GPUStack Setup Guide (2026): Open-Source GPU Cluster Manager for Local LLMs
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
GPUStack is the open-source GPU cluster manager that fills an awkward gap between "Ollama on one machine" and "full-blown KServe in Kubernetes." If you have 2-10 GPU machines — maybe a mix of NVIDIA workstations, an AMD box, a Mac Studio, and a few cloud instances — and you want to treat them as one LLM cluster with a unified OpenAI API and automatic model placement, GPUStack is the simplest way to do it.
This guide covers everything: the architecture (server + workers), installation across Linux / Mac / Windows / Docker / Kubernetes, GPU auto-discovery on heterogeneous hardware, model deployment with backend selection, replica placement, the OpenAI-compatible gateway, RBAC and API keys, monitoring, and tuning recipes for common cluster shapes.
Table of Contents
- What GPUStack Is
- Architecture: Server + Workers + Gateway
- Hardware Coverage
- Installation: Server Node
- Adding Worker Nodes
- Docker / Kubernetes Deployment
- Deploying Your First Model
- Backends: vLLM, llama.cpp, Ascend MindIE, Vox-box
- Replicas, Placement, and Failover
- The Unified OpenAI Gateway
- Authentication and RBAC
- Monitoring and Metrics
- Heterogeneous Cluster Examples
- GPUStack vs KServe vs Triton
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What GPUStack Is {#what-it-is}
GPUStack is a Go-based control plane + Python worker agents that:
- Discover GPUs on every worker node (NVIDIA, AMD, Apple, Ascend, DCU, Intel)
- Schedule models onto suitable GPUs across nodes
- Run heterogeneous backends (vLLM, llama.cpp, ascend-mindie, vox-box) per model
- Expose a unified OpenAI-compatible API gateway
- Provide a web UI for cluster + model management
- Handle authentication, API keys, and per-key rate limits
Project: github.com/gpustack/gpustack. Apache 2.0 licensed. Backed by the Lepton team.
Architecture: Server + Workers + Gateway {#architecture}
┌─────────────────┐
│ GPUStack UI │
│ (Browser) │
└────────┬────────┘
│
▼
┌─────────────────────────┐
│ GPUStack Server │
│ (Scheduler + Gateway) │
│ - PostgreSQL state │
│ - OpenAI API gateway │
└────┬─────────┬────┬─────┘
│ │ │
┌────────────┘ │ └────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Worker (NVIDIA) │ │ Worker (Mac) │ │ Worker (AMD) │
│ vLLM / llama.cpp│ │ llama.cpp │ │ vLLM / llama.cpp│
│ RTX 4090 x2 │ │ M4 Max 64GB │ │ RX 7900 XTX │
└─────────────────┘ └─────────────────┘ └─────────────────┘
The server runs the scheduler, the gateway, the web UI, and a small embedded Postgres for state. Workers run on each GPU machine and execute the model containers.
Hardware Coverage {#hardware}
| Vendor | Backends | Notes |
|---|---|---|
| NVIDIA | vLLM, llama.cpp, TGI | CUDA 11.8+ |
| AMD | vLLM-rocm, llama.cpp HIP | ROCm 6.x; RX 7900-series, MI300X |
| Apple | llama.cpp Metal | M1+ |
| Huawei Ascend | mindie-llm | 910B / 310 / Atlas |
| Hygon DCU | dtk-llm | Z100 / K100 |
| Intel Arc | llama.cpp Vulkan / OpenVINO | Beta |
| CPU | llama.cpp | Fallback |
The Ascend / DCU support is unique among open-source LLM cluster managers and a key reason GPUStack is popular in Asian / Chinese-government deployments.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: Server Node {#install-server}
# Linux / macOS
curl -sfL https://get.gpustack.ai | sh -
# Start the server (also acts as the first worker)
gpustack start
# Get the server's bootstrap token for adding workers
cat /var/lib/gpustack/token
Default ports: 80 (UI + API), 10150 (worker registration).
Browse to http://<server-ip>. Default login: admin / password from /var/lib/gpustack/initial_admin_password.
Windows
Download gpustack-installer.exe from GitHub releases and run. Same defaults.
Docker
docker run -d --gpus all \
--name gpustack-server \
--restart unless-stopped \
-p 80:80 \
-v gpustack-data:/var/lib/gpustack \
gpustack/gpustack:latest
Adding Worker Nodes {#install-workers}
On each additional GPU machine:
curl -sfL https://get.gpustack.ai | sh -
gpustack start \
--server-url http://<server-ip> \
--token <bootstrap-token>
Worker registers, GPU is discovered, and it appears in the UI under Resources → Workers.
For Docker workers:
docker run -d --gpus all \
--name gpustack-worker \
--restart unless-stopped \
-e GPUSTACK_SERVER_URL=http://<server-ip> \
-e GPUSTACK_TOKEN=<token> \
gpustack/gpustack:latest worker
Docker / Kubernetes Deployment {#docker-k8s}
For Kubernetes:
helm repo add gpustack https://gpustack.github.io/helm-charts/
helm install gpustack gpustack/gpustack \
--namespace gpustack --create-namespace \
--set server.persistence.size=20Gi \
--set worker.daemonset.enabled=true
The chart deploys the server as a StatefulSet and workers as a DaemonSet on all nodes labeled gpustack.io/worker=true. Add the label to GPU nodes:
kubectl label node gpu-node-1 gpustack.io/worker=true
Deploying Your First Model {#first-model}
Via UI: Models → Deploy Model → enter Hugging Face ID (e.g., Qwen/Qwen2.5-7B-Instruct) → click Deploy.
Via API:
curl http://<server>/v1/models \
-H "Authorization: Bearer <api-key>" \
-H "Content-Type: application/json" \
-d '{
"name": "qwen2.5-7b",
"source": "huggingface",
"huggingface_repo_id": "Qwen/Qwen2.5-7B-Instruct",
"backend": "vllm",
"replicas": 2,
"gpu_count": 1
}'
GPUStack picks 2 GPUs across the cluster with enough VRAM, pulls the weights once per node, and starts vLLM containers.
Backends: vLLM, llama.cpp, Ascend MindIE, Vox-box {#backends}
| Backend | When |
|---|---|
| vLLM | NVIDIA / AMD ROCm; high-throughput serving |
| llama.cpp | Mac, AMD Vulkan, CPU, GGUF format |
| mindie-llm | Huawei Ascend NPUs |
| dtk-llm | Hygon DCU |
| vox-box | Audio (Whisper, TTS) |
| diffusers | Image generation |
You can pin a backend per model or let GPUStack auto-select based on hardware availability. Mixed-backend deployments are common: same model name, vLLM replicas on NVIDIA, llama.cpp replicas on Mac.
Replicas, Placement, and Failover {#replicas}
Set replicas: N on a model to deploy N independent copies. The scheduler:
- Filters workers by required hardware (
gpu_count,gpu_vendor, VRAM). - Sorts by current free VRAM and load.
- Places replicas to maximize fault tolerance (different nodes when possible).
- Health-checks replicas every 10s; replaces failed ones.
Failover: if a worker drops, the gateway routes new requests to remaining replicas. In-flight requests on the failed replica fail (no transparent retry).
The Unified OpenAI Gateway {#gateway}
curl http://<server>/v1/chat/completions \
-H "Authorization: Bearer <api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b",
"messages": [{"role":"user","content":"hi"}]
}'
The gateway:
- Routes requests to a healthy replica with capacity
- Tracks per-replica request counts for load balancing
- Streams SSE / chunked responses through to the client
- Adds usage tracking per API key
- Honors per-key rate limits
OpenAI parity: chat completions, completions, embeddings, image generation (with diffusers backend), audio (with vox-box backend), reranker.
Authentication and RBAC {#auth}
UI: Settings → API Keys → Create. Set scope (model allow-list), rate limit, expiration.
CLI:
gpustack key create --name dev-team --scope "qwen2.5-7b,nomic-embed" \
--rate-limit-rpm 100 --expires 30d
Roles: admin (everything), user (deploy own models, use shared models). LDAP / OIDC integration is in tech preview.
Monitoring and Metrics {#monitoring}
Built-in dashboard shows per-GPU utilization, per-model TPS, per-API-key request counts. Prometheus scrape on /metrics:
| Metric | Meaning |
|---|---|
gpustack_gpu_memory_used_bytes | Per-GPU VRAM |
gpustack_gpu_utilization_percent | GPU compute |
gpustack_model_requests_total | Per-model request count |
gpustack_model_active_requests | In-flight per model |
gpustack_api_key_requests_total | Per-key requests |
Pair with Grafana for dashboards.
Heterogeneous Cluster Examples {#heterogeneous}
Example 1: Solo developer with workstation + Mac
- Server + worker on RTX 4090 desktop (Linux)
- Worker on M4 Max MacBook Pro (Mac)
- Models: vLLM Qwen 2.5 7B on the 4090 (high throughput), llama.cpp Llama 3.3 70B on the Mac (large model)
Example 2: Small team
- 1 server + 4 workers in a homelab
- 2x RTX 4090 (NVIDIA workers)
- 1x RX 7900 XTX (AMD worker)
- 1x Mac Studio M4 Max 128GB (Mac worker)
- Models: 70B on the Mac, 32B AWQ replicated across 4090s, 8B on the 7900 XTX for embedding workloads
Example 3: Air-gapped enterprise cluster
- 1 server + 8 workers on 8x H100 nodes
- vLLM Llama 3.1 405B FP8 with TP=8, replicated 2x for HA
- vLLM Qwen 2.5 7B AWQ on smaller A6000 nodes
- llama.cpp Whisper for transcription
- All with Ascend / DCU fallback nodes for sovereignty requirements
GPUStack vs KServe vs Triton {#vs-kserve}
| Property | GPUStack | KServe | Triton |
|---|---|---|---|
| Deployment model | Standalone or K8s | K8s only | Standalone or K8s |
| LLM-specific | Yes | No (general) | Partial |
| Heterogeneous hardware | Yes | No | Limited |
| Backends | vLLM, llama.cpp, mindie, dtk | Any (custom) | TF, PyTorch, TRT-LLM, ONNX |
| OpenAI gateway built-in | Yes | No (custom) | Via TGI |
| Web UI | Yes | No (Knative dashboard) | No |
| Multi-tenant API keys | Yes | Via Istio | Via Triton+gateway |
| Production maturity | Growing | Mature | Mature |
For a Kubernetes-native, audit-heavy production AI platform, KServe + vLLM is the canonical stack. For an opinionated all-in-one LLM cluster manager that runs on whatever hardware you have, GPUStack.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Worker won't register | Token expired | gpustack token rotate and re-add |
| Model stuck in "Starting" | OOM at backend init | Lower replicas or use smaller quant |
| GPU not detected | Driver / runtime missing | Install nvidia-container-toolkit / rocm |
| Gateway 503 | All replicas unhealthy | Check worker logs in UI |
| Slow first request | Container cold start | Pre-warm with periodic health pings |
| Mac worker disconnects | Network sleep | Disable sleep on Mac worker |
| Ascend / DCU not detected | Vendor toolkit missing | Install the vendor SDK before worker start |
FAQ {#faq}
See answers to common GPUStack questions below.
Sources: GPUStack GitHub | GPUStack docs | Internal benchmarks across NVIDIA, AMD, Apple, Ascend.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!