★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Production

GPUStack Setup Guide (2026): Open-Source GPU Cluster Manager for Local LLMs

May 1, 2026
24 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

GPUStack is the open-source GPU cluster manager that fills an awkward gap between "Ollama on one machine" and "full-blown KServe in Kubernetes." If you have 2-10 GPU machines — maybe a mix of NVIDIA workstations, an AMD box, a Mac Studio, and a few cloud instances — and you want to treat them as one LLM cluster with a unified OpenAI API and automatic model placement, GPUStack is the simplest way to do it.

This guide covers everything: the architecture (server + workers), installation across Linux / Mac / Windows / Docker / Kubernetes, GPU auto-discovery on heterogeneous hardware, model deployment with backend selection, replica placement, the OpenAI-compatible gateway, RBAC and API keys, monitoring, and tuning recipes for common cluster shapes.

Table of Contents

  1. What GPUStack Is
  2. Architecture: Server + Workers + Gateway
  3. Hardware Coverage
  4. Installation: Server Node
  5. Adding Worker Nodes
  6. Docker / Kubernetes Deployment
  7. Deploying Your First Model
  8. Backends: vLLM, llama.cpp, Ascend MindIE, Vox-box
  9. Replicas, Placement, and Failover
  10. The Unified OpenAI Gateway
  11. Authentication and RBAC
  12. Monitoring and Metrics
  13. Heterogeneous Cluster Examples
  14. GPUStack vs KServe vs Triton
  15. Troubleshooting
  16. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What GPUStack Is {#what-it-is}

GPUStack is a Go-based control plane + Python worker agents that:

  • Discover GPUs on every worker node (NVIDIA, AMD, Apple, Ascend, DCU, Intel)
  • Schedule models onto suitable GPUs across nodes
  • Run heterogeneous backends (vLLM, llama.cpp, ascend-mindie, vox-box) per model
  • Expose a unified OpenAI-compatible API gateway
  • Provide a web UI for cluster + model management
  • Handle authentication, API keys, and per-key rate limits

Project: github.com/gpustack/gpustack. Apache 2.0 licensed. Backed by the Lepton team.


Architecture: Server + Workers + Gateway {#architecture}

                      ┌─────────────────┐
                      │   GPUStack UI   │
                      │   (Browser)      │
                      └────────┬────────┘
                               │
                               ▼
                      ┌─────────────────────────┐
                      │   GPUStack Server       │
                      │  (Scheduler + Gateway)  │
                      │  - PostgreSQL state     │
                      │  - OpenAI API gateway   │
                      └────┬─────────┬────┬─────┘
                           │         │    │
              ┌────────────┘         │    └────────────┐
              │                      │                  │
              ▼                      ▼                  ▼
     ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
     │ Worker (NVIDIA) │  │  Worker (Mac)   │  │  Worker (AMD)   │
     │ vLLM / llama.cpp│  │   llama.cpp     │  │ vLLM / llama.cpp│
     │  RTX 4090 x2    │  │   M4 Max 64GB   │  │   RX 7900 XTX   │
     └─────────────────┘  └─────────────────┘  └─────────────────┘

The server runs the scheduler, the gateway, the web UI, and a small embedded Postgres for state. Workers run on each GPU machine and execute the model containers.


Hardware Coverage {#hardware}

VendorBackendsNotes
NVIDIAvLLM, llama.cpp, TGICUDA 11.8+
AMDvLLM-rocm, llama.cpp HIPROCm 6.x; RX 7900-series, MI300X
Applellama.cpp MetalM1+
Huawei Ascendmindie-llm910B / 310 / Atlas
Hygon DCUdtk-llmZ100 / K100
Intel Arcllama.cpp Vulkan / OpenVINOBeta
CPUllama.cppFallback

The Ascend / DCU support is unique among open-source LLM cluster managers and a key reason GPUStack is popular in Asian / Chinese-government deployments.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: Server Node {#install-server}

# Linux / macOS
curl -sfL https://get.gpustack.ai | sh -

# Start the server (also acts as the first worker)
gpustack start

# Get the server's bootstrap token for adding workers
cat /var/lib/gpustack/token

Default ports: 80 (UI + API), 10150 (worker registration).

Browse to http://<server-ip>. Default login: admin / password from /var/lib/gpustack/initial_admin_password.

Windows

Download gpustack-installer.exe from GitHub releases and run. Same defaults.

Docker

docker run -d --gpus all \
    --name gpustack-server \
    --restart unless-stopped \
    -p 80:80 \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack:latest

Adding Worker Nodes {#install-workers}

On each additional GPU machine:

curl -sfL https://get.gpustack.ai | sh -

gpustack start \
    --server-url http://<server-ip> \
    --token <bootstrap-token>

Worker registers, GPU is discovered, and it appears in the UI under Resources → Workers.

For Docker workers:

docker run -d --gpus all \
    --name gpustack-worker \
    --restart unless-stopped \
    -e GPUSTACK_SERVER_URL=http://<server-ip> \
    -e GPUSTACK_TOKEN=<token> \
    gpustack/gpustack:latest worker

Docker / Kubernetes Deployment {#docker-k8s}

For Kubernetes:

helm repo add gpustack https://gpustack.github.io/helm-charts/
helm install gpustack gpustack/gpustack \
    --namespace gpustack --create-namespace \
    --set server.persistence.size=20Gi \
    --set worker.daemonset.enabled=true

The chart deploys the server as a StatefulSet and workers as a DaemonSet on all nodes labeled gpustack.io/worker=true. Add the label to GPU nodes:

kubectl label node gpu-node-1 gpustack.io/worker=true

Deploying Your First Model {#first-model}

Via UI: Models → Deploy Model → enter Hugging Face ID (e.g., Qwen/Qwen2.5-7B-Instruct) → click Deploy.

Via API:

curl http://<server>/v1/models \
    -H "Authorization: Bearer <api-key>" \
    -H "Content-Type: application/json" \
    -d '{
        "name": "qwen2.5-7b",
        "source": "huggingface",
        "huggingface_repo_id": "Qwen/Qwen2.5-7B-Instruct",
        "backend": "vllm",
        "replicas": 2,
        "gpu_count": 1
    }'

GPUStack picks 2 GPUs across the cluster with enough VRAM, pulls the weights once per node, and starts vLLM containers.


Backends: vLLM, llama.cpp, Ascend MindIE, Vox-box {#backends}

BackendWhen
vLLMNVIDIA / AMD ROCm; high-throughput serving
llama.cppMac, AMD Vulkan, CPU, GGUF format
mindie-llmHuawei Ascend NPUs
dtk-llmHygon DCU
vox-boxAudio (Whisper, TTS)
diffusersImage generation

You can pin a backend per model or let GPUStack auto-select based on hardware availability. Mixed-backend deployments are common: same model name, vLLM replicas on NVIDIA, llama.cpp replicas on Mac.


Replicas, Placement, and Failover {#replicas}

Set replicas: N on a model to deploy N independent copies. The scheduler:

  1. Filters workers by required hardware (gpu_count, gpu_vendor, VRAM).
  2. Sorts by current free VRAM and load.
  3. Places replicas to maximize fault tolerance (different nodes when possible).
  4. Health-checks replicas every 10s; replaces failed ones.

Failover: if a worker drops, the gateway routes new requests to remaining replicas. In-flight requests on the failed replica fail (no transparent retry).


The Unified OpenAI Gateway {#gateway}

curl http://<server>/v1/chat/completions \
    -H "Authorization: Bearer <api-key>" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen2.5-7b",
        "messages": [{"role":"user","content":"hi"}]
    }'

The gateway:

  • Routes requests to a healthy replica with capacity
  • Tracks per-replica request counts for load balancing
  • Streams SSE / chunked responses through to the client
  • Adds usage tracking per API key
  • Honors per-key rate limits

OpenAI parity: chat completions, completions, embeddings, image generation (with diffusers backend), audio (with vox-box backend), reranker.


Authentication and RBAC {#auth}

UI: Settings → API Keys → Create. Set scope (model allow-list), rate limit, expiration.

CLI:

gpustack key create --name dev-team --scope "qwen2.5-7b,nomic-embed" \
    --rate-limit-rpm 100 --expires 30d

Roles: admin (everything), user (deploy own models, use shared models). LDAP / OIDC integration is in tech preview.


Monitoring and Metrics {#monitoring}

Built-in dashboard shows per-GPU utilization, per-model TPS, per-API-key request counts. Prometheus scrape on /metrics:

MetricMeaning
gpustack_gpu_memory_used_bytesPer-GPU VRAM
gpustack_gpu_utilization_percentGPU compute
gpustack_model_requests_totalPer-model request count
gpustack_model_active_requestsIn-flight per model
gpustack_api_key_requests_totalPer-key requests

Pair with Grafana for dashboards.


Heterogeneous Cluster Examples {#heterogeneous}

Example 1: Solo developer with workstation + Mac

  • Server + worker on RTX 4090 desktop (Linux)
  • Worker on M4 Max MacBook Pro (Mac)
  • Models: vLLM Qwen 2.5 7B on the 4090 (high throughput), llama.cpp Llama 3.3 70B on the Mac (large model)

Example 2: Small team

  • 1 server + 4 workers in a homelab
  • 2x RTX 4090 (NVIDIA workers)
  • 1x RX 7900 XTX (AMD worker)
  • 1x Mac Studio M4 Max 128GB (Mac worker)
  • Models: 70B on the Mac, 32B AWQ replicated across 4090s, 8B on the 7900 XTX for embedding workloads

Example 3: Air-gapped enterprise cluster

  • 1 server + 8 workers on 8x H100 nodes
  • vLLM Llama 3.1 405B FP8 with TP=8, replicated 2x for HA
  • vLLM Qwen 2.5 7B AWQ on smaller A6000 nodes
  • llama.cpp Whisper for transcription
  • All with Ascend / DCU fallback nodes for sovereignty requirements

GPUStack vs KServe vs Triton {#vs-kserve}

PropertyGPUStackKServeTriton
Deployment modelStandalone or K8sK8s onlyStandalone or K8s
LLM-specificYesNo (general)Partial
Heterogeneous hardwareYesNoLimited
BackendsvLLM, llama.cpp, mindie, dtkAny (custom)TF, PyTorch, TRT-LLM, ONNX
OpenAI gateway built-inYesNo (custom)Via TGI
Web UIYesNo (Knative dashboard)No
Multi-tenant API keysYesVia IstioVia Triton+gateway
Production maturityGrowingMatureMature

For a Kubernetes-native, audit-heavy production AI platform, KServe + vLLM is the canonical stack. For an opinionated all-in-one LLM cluster manager that runs on whatever hardware you have, GPUStack.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Worker won't registerToken expiredgpustack token rotate and re-add
Model stuck in "Starting"OOM at backend initLower replicas or use smaller quant
GPU not detectedDriver / runtime missingInstall nvidia-container-toolkit / rocm
Gateway 503All replicas unhealthyCheck worker logs in UI
Slow first requestContainer cold startPre-warm with periodic health pings
Mac worker disconnectsNetwork sleepDisable sleep on Mac worker
Ascend / DCU not detectedVendor toolkit missingInstall the vendor SDK before worker start

FAQ {#faq}

See answers to common GPUStack questions below.


Sources: GPUStack GitHub | GPUStack docs | Internal benchmarks across NVIDIA, AMD, Apple, Ascend.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a GPUStack server + worker template. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators