Question 1

What is GPUStack and what problem does it solve?

Accepted Answer

GPUStack (gpustack/gpustack on GitHub) is an open-source GPU cluster manager for local LLM inference. You install the GPUStack server on one node and the worker agent on each GPU machine — including a mix of NVIDIA, AMD, Apple Silicon, Huawei Ascend, and Hygon DCU. GPUStack discovers all the GPUs, lets you deploy models across them, picks the right backend (vLLM, llama.cpp, ascend-mindie) per model, and exposes a unified OpenAI-compatible API. It is the simplest way to turn a heterogeneous fleet of GPUs into a single LLM service.

Question 2

How is GPUStack different from running Ollama on each machine?

Accepted Answer

Ollama treats each machine as an island — separate model lists, separate APIs, separate auth. GPUStack treats the fleet as one cluster: deploy a model once and the scheduler picks the best GPU(s) for it. It supports automatic model placement (which GPUs have free VRAM?), model replicas across nodes for HA, mixed-vendor inference (vLLM on the NVIDIA box, llama.cpp on the Mac), and a single API gateway. For solo dev with one machine: stay on Ollama. For 2+ GPU machines: GPUStack is the natural step up.

Question 3

What hardware does GPUStack support?

Accepted Answer

NVIDIA (CUDA 11.8+, Ampere through Blackwell), AMD (ROCm 6.x on Radeon RX 7900-series and MI300X), Apple Silicon (Metal on M1/M2/M3/M4), Huawei Ascend (910B / 310 / Atlas series), Hygon DCU (Z100/K100), Intel Arc (via Vulkan), and CPU-only. The major distinguishing feature is built-in support for Chinese-domestic AI accelerators (Ascend, DCU) which most other tools lack. GPUStack auto-detects the hardware on each worker and picks the matching backend.

Question 4

Can GPUStack run vLLM and llama.cpp on different nodes for the same model?

Accepted Answer

Yes — that is one of its core strengths. You deploy a model with multiple replicas; GPUStack schedules each replica on the most suitable hardware. For example: vLLM on a 2x H100 node for high-throughput, llama.cpp on a Mac Studio for backup capacity, and another vLLM replica on RTX 4090s for failover. The OpenAI gateway routes requests to whichever replica has free capacity. Backend mix happens transparently.

Question 5

How do I install GPUStack on Linux / Mac / Windows?

Accepted Answer

Linux / Mac: `curl -sfL https://get.gpustack.ai | sh`. Windows: download the installer from the GitHub releases. Docker: `docker run -d --gpus all -p 80:80 -p 443:443 gpustack/gpustack:latest`. The first node becomes the server (control plane). Add workers with `gpustack start --server-url http:// --token `. The web UI runs on the server's port 80 by default.

Question 6

Does GPUStack work in Kubernetes?

Accepted Answer

Yes — there is an official Helm chart and Operator for Kubernetes deployments. The Helm chart deploys the server as a StatefulSet and adds a DaemonSet for workers on every GPU node. The Operator adds CRDs for declarative model deployment: `kind: Model` with the inference engine, replicas, and resource hints. Recommended for >5-node clusters or shops already running Kubernetes. For 1-3 node setups the bare install is simpler.

Question 7

What models can I deploy with GPUStack?

Accepted Answer

Any model supported by the backend GPUStack chooses. The model catalog includes Llama 3.1 / 3.2 / 3.3, Qwen 2.5 / 3 (text and VL), DeepSeek V3 and R1, GLM-4.5, Mistral / Mixtral, Phi-4, Gemma 2 / 3, Granite 3, plus embedding models (Nomic, BGE, GTE), rerankers, Whisper, Stable Diffusion, and Flux. You can also point at any Hugging Face repo. Model placement is automatic — GPUStack picks the right backend (vLLM / llama.cpp / etc.) based on the format and target hardware.

Question 8

How does GPUStack compare to KServe / Triton / Ray Serve?

Accepted Answer

Different use cases. KServe and Triton are Kubernetes-native serving platforms designed for production model deployment with traffic splitting, canary, autoscaling. Ray Serve is a generic Python-first model server. GPUStack is purpose-built for **LLM-specific workloads on heterogeneous local hardware** — its scheduler understands LLM-specific constraints (VRAM per layer, KV cache, tensor parallelism) and supports backends those general platforms don't. For a Kubernetes-native production AI platform: KServe + vLLM. For an opinionated LLM cluster manager that just works: GPUStack. They can complement each other.

Vendor	Backends	Notes
NVIDIA	vLLM, llama.cpp, TGI	CUDA 11.8+
AMD	vLLM-rocm, llama.cpp HIP	ROCm 6.x; RX 7900-series, MI300X
Apple	llama.cpp Metal	M1+
Huawei Ascend	mindie-llm	910B / 310 / Atlas
Hygon DCU	dtk-llm	Z100 / K100
Intel Arc	llama.cpp Vulkan / OpenVINO	Beta
CPU	llama.cpp	Fallback

Backend	When
vLLM	NVIDIA / AMD ROCm; high-throughput serving
llama.cpp	Mac, AMD Vulkan, CPU, GGUF format
mindie-llm	Huawei Ascend NPUs
dtk-llm	Hygon DCU
vox-box	Audio (Whisper, TTS)
diffusers	Image generation

Metric	Meaning
`gpustack_gpu_memory_used_bytes`	Per-GPU VRAM
`gpustack_gpu_utilization_percent`	GPU compute
`gpustack_model_requests_total`	Per-model request count
`gpustack_model_active_requests`	In-flight per model
`gpustack_api_key_requests_total`	Per-key requests

Property	GPUStack	KServe	Triton
Deployment model	Standalone or K8s	K8s only	Standalone or K8s
LLM-specific	Yes	No (general)	Partial
Heterogeneous hardware	Yes	No	Limited
Backends	vLLM, llama.cpp, mindie, dtk	Any (custom)	TF, PyTorch, TRT-LLM, ONNX
OpenAI gateway built-in	Yes	No (custom)	Via TGI
Web UI	Yes	No (Knative dashboard)	No
Multi-tenant API keys	Yes	Via Istio	Via Triton+gateway
Production maturity	Growing	Mature	Mature

Symptom	Cause	Fix
Worker won't register	Token expired	`gpustack token rotate` and re-add
Model stuck in "Starting"	OOM at backend init	Lower replicas or use smaller quant
GPU not detected	Driver / runtime missing	Install nvidia-container-toolkit / rocm
Gateway 503	All replicas unhealthy	Check worker logs in UI
Slow first request	Container cold start	Pre-warm with periodic health pings
Mac worker disconnects	Network sleep	Disable sleep on Mac worker
Ascend / DCU not detected	Vendor toolkit missing	Install the vendor SDK before worker start

GPUStack Setup Guide (2026): Open-Source GPU Cluster Manager for Local LLMs

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What GPUStack Is {#what-it-is}

Architecture: Server + Workers + Gateway {#architecture}

Hardware Coverage {#hardware}

Reading articles is good. Building is better.

Installation: Server Node {#install-server}

Windows

Docker

Adding Worker Nodes {#install-workers}

Docker / Kubernetes Deployment {#docker-k8s}

Deploying Your First Model {#first-model}

Backends: vLLM, llama.cpp, Ascend MindIE, Vox-box {#backends}

Replicas, Placement, and Failover {#replicas}

The Unified OpenAI Gateway {#gateway}

Authentication and RBAC {#auth}

Monitoring and Metrics {#monitoring}

Heterogeneous Cluster Examples {#heterogeneous}

Example 1: Solo developer with workstation + Mac

Example 2: Small team

Example 3: Air-gapped enterprise cluster

GPUStack vs KServe vs Triton {#vs-kserve}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Distributed Inference for Local AI

vLLM Complete Setup Guide

Multi-GPU Ollama Setup

Ollama Kubernetes Deployment

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI