vLLM Complete Setup Guide for Local LLMs (2026): Install, Tune, Serve
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
vLLM delivers 5-20x higher aggregate throughput than Ollama on multi-user workloads — and with the right configuration, lower per-token latency too. This guide covers everything: installation, model loading, quantization, tensor parallelism, OpenAI-compatible serving, prefix caching, chunked prefill, Docker, Kubernetes, observability, and tuning playbooks for the most common GPUs.
If you serve more than one concurrent user, vLLM is the upgrade that pays for itself the same day you deploy it.
Table of Contents
- What vLLM Is, In One Page
- vLLM vs Ollama vs llama.cpp vs TensorRT-LLM
- System Requirements
- Installation
- Your First Model in 60 Seconds
- OpenAI-Compatible API
- Quantization: AWQ, GPTQ, FP8, INT8
- PagedAttention & KV-Cache Tuning
- Continuous Batching & Scheduling
- Tensor Parallelism & Multi-GPU
- Prefix Caching & Chunked Prefill
- Speculative Decoding
- Tool Calling & Structured Outputs
- Docker Deployment
- Kubernetes & KServe
- Observability: Prometheus, Tracing, Logging
- Authentication, Rate Limiting, Multi-Tenancy
- Benchmarking Your Deployment
- Tuning Recipes by GPU
- Common Errors & Fixes
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What vLLM Is, In One Page {#what-vllm-is}
vLLM is an open-source inference and serving engine for LLMs, originally built at UC Berkeley (PagedAttention paper, 2023). It is written in Python with custom CUDA kernels and is now the de facto standard for self-hosted high-throughput LLM serving.
Core innovations:
- PagedAttention — KV cache stored in fixed-size blocks, near-zero fragmentation.
- Continuous batching — schedules at the iteration level, no idle GPU.
- Optimized kernels — FlashAttention 2/3, FP8, AWQ-INT4 dequant fused into matmul.
- OpenAI-compatible API — drop-in for any client expecting OpenAI endpoints.
- Distributed inference — tensor parallel + pipeline parallel + Ray-based multi-node.
What it is not: a desktop app, a fine-tuning tool, or a model converter. It is a server. Pair it with LiteLLM for routing, Open WebUI for chat UI, and Langfuse for tracing.
vLLM vs Ollama vs llama.cpp vs TensorRT-LLM {#comparison}
| Feature | Ollama | llama.cpp | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|---|---|
| Single-user latency | Excellent | Excellent | Good | Best | Excellent |
| Aggregate throughput (32 users) | Poor | Poor | Excellent | Excellent | Excellent |
| Multi-user concurrency | 1-4 | 1-4 | 256+ | 128+ | 256+ |
| OpenAI API | ✅ | ✅ | ✅ | via Triton | ✅ |
| Quantization formats | GGUF | GGUF | AWQ, GPTQ, FP8, INT8, GGUF | INT4-AWQ, FP8, INT8 | AWQ, FP8 |
| Prefix caching | ❌ | partial | ✅ | ✅ | ✅ (best) |
| Tensor parallel | ❌ | partial | ✅ | ✅ | ✅ |
| Pipeline parallel | basic | basic | ✅ | ✅ | ✅ |
| FP8 (Ada/Hopper) | ❌ | ❌ | ✅ | ✅ | ✅ |
| Setup time | 60s | 5min | 5min | 30-60min | 5min |
| Best for | Desktop | Desktop / edge | Production servers | Lowest latency | Agent frameworks |
Decision rule: if more than 4 users hit your endpoint at once, switch to vLLM. If you need the lowest single-stream latency for a critical path, use TensorRT-LLM. Otherwise stay on Ollama for simplicity.
System Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| GPU | Compute capability 7.0+ (V100/T4/RTX 20-series) | Ampere or newer (RTX 30/40/50, A100, H100) |
| VRAM (per model size) | 8B BF16: 18GB / 8B AWQ: 8GB / 70B AWQ: 40GB | +30% headroom for KV cache |
| Driver | 535+ | 550+ (FP8 needs Ada/Hopper) |
| CUDA Toolkit | 12.1 | 12.4+ (build only; runtime uses driver) |
| Python | 3.9 | 3.10-3.12 |
| RAM | 32GB | 64GB+ for 70B models |
| Disk | 100GB free | NVMe; models can be 200GB+ in BF16 |
| OS | Linux (Ubuntu 22.04+) | Ubuntu 22.04 LTS |
Windows native is not officially supported; use WSL2. macOS is not supported (use llama.cpp / Ollama or MLX).
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation {#installation}
Option 1: pip (most common)
# Fresh venv, always
python3.11 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip
# vLLM with CUDA 12.4 wheels
pip install vllm
# Verify
python -c "import vllm; print(vllm.__version__)"
Option 2: Docker (recommended for production)
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
--ipc=host is required because vLLM uses CUDA shared memory between worker processes.
Option 3: from source (custom CUDA, AMD ROCm, custom kernels)
git clone https://github.com/vllm-project/vllm.git
cd vllm
export TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0 12.0" # Ampere, Ada, Hopper, Blackwell
pip install -e .
Building from source takes 10-30 minutes depending on GPU count. Set MAX_JOBS=8 to limit RAM usage.
Option 4: AMD ROCm (experimental)
docker pull rocm/vllm:latest
docker run --device /dev/kfd --device /dev/dri --group-add video \
-p 8000:8000 \
rocm/vllm:latest \
--model meta-llama/Llama-3.1-8B-Instruct
ROCm support targets MI300X, MI250X, and Radeon RX 7900 XTX. Performance is 60-80% of CUDA equivalents. See our AMD ROCm Setup Guide for details.
Your First Model in 60 Seconds {#first-model}
# Authenticate with Hugging Face if the model is gated
huggingface-cli login
# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90
In another terminal:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
"max_tokens": 200,
"temperature": 0.7
}'
That's the entire baseline setup. Everything else in this guide is tuning.
OpenAI-Compatible API {#openai-api}
vLLM implements the OpenAI REST API spec. Any OpenAI client works unchanged with a different base_url.
Python (official OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Endpoints
| Endpoint | Purpose |
|---|---|
POST /v1/chat/completions | Chat with messages (most common) |
POST /v1/completions | Legacy completions API |
POST /v1/embeddings | Vector embeddings (with embedding models) |
GET /v1/models | List loaded models |
GET /health | Liveness probe |
GET /metrics | Prometheus metrics |
Streaming, tool calls, structured outputs
All work. Set stream: true for SSE, pass tools for function calling, and response_format: { type: "json_schema", json_schema: {...} } for guaranteed-valid JSON output.
Quantization: AWQ, GPTQ, FP8, INT8 {#quantization}
Quantization is the biggest single performance lever after fitting in VRAM. See our CUDA optimization guide and AWQ vs GPTQ vs GGUF for the underlying theory; here are the vLLM-specific recipes.
AWQ-INT4 (best general purpose)
vllm serve casperhansen/llama-3.1-8b-instruct-awq \
--quantization awq \
--max-model-len 16384
Quality is typically within 1% of BF16; size is ~4x smaller. Available for almost every popular model on Hugging Face.
FP8 (Ada / Blackwell / Hopper only)
vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 \
--tensor-parallel-size 2
FP8 uses native Tensor Core support and roughly doubles throughput vs BF16 on RTX 4090 / 5090 / H100. Requires a model published in FP8 format; NeuralMagic and Meta publish FP8 versions of popular Llama models.
GPTQ-INT4
vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
--quantization gptq \
--max-model-len 4096
Older but still common. Group size 128 is standard.
W8A8 INT8 (SmoothQuant / RTN)
vllm serve neuralmagic/Llama-3.1-8B-Instruct-quantized.w8a8 \
--quantization compressed-tensors
Slightly higher quality than INT4 at 2x the size; useful when you have plenty of VRAM and want maximum throughput.
Quantization decision tree
- Have Ada/Blackwell/Hopper + FP8 model exists? → FP8.
- Need broad model coverage? → AWQ-INT4.
- Plenty of VRAM, max quality? → BF16 (no quantization).
- Tight VRAM budget on Ampere? → AWQ-INT4 or GPTQ-INT4.
PagedAttention & KV-Cache Tuning {#paged-attention}
PagedAttention is automatic — you do not configure it directly. But you do control how much memory it gets.
Block size
--block-size 16 # default, best for most workloads
--block-size 32 # better for long contexts (>32K)
Larger blocks reduce metadata overhead but increase fragmentation. 16 is a good default; bump to 32 if you serve only long-context requests.
GPU memory utilization
--gpu-memory-utilization 0.92
vLLM allocates this fraction of total VRAM for model weights + KV cache. Default is 0.90. Push to 0.95-0.97 on dedicated inference boxes; keep 0.85-0.90 if you also run other GPU workloads.
KV-cache dtype
--kv-cache-dtype fp8_e4m3 # Ada+ only — recommended for long context
--kv-cache-dtype fp8_e5m2 # Better range, slightly worse precision
--kv-cache-dtype auto # Match model dtype
FP8 KV cache halves the cache memory vs BF16. On Llama 3.1 70B at 32K context, this frees ~10GB of VRAM that becomes available for more concurrent requests.
Max model length
--max-model-len 32768
vLLM pre-allocates KV-cache space for the longest possible sequence. Setting this lower than the model's native context length directly increases concurrency. If you only need 4K context, set --max-model-len 4096 and you can fit 4-8x more concurrent requests.
Continuous Batching & Scheduling {#continuous-batching}
The scheduler decides which requests run at each step. Two key parameters:
--max-num-seqs 256 # max concurrent sequences
--max-num-batched-tokens 8192 # max tokens computed per step
max-num-seqs caps concurrency; max-num-batched-tokens caps the per-step compute. The right values depend on workload:
| Workload | max-num-seqs | max-num-batched-tokens |
|---|---|---|
| Chat (short prompts, short replies) | 256 | 4096 |
| RAG (long prompts, short replies) | 64 | 16384 |
| Code completion (medium both) | 128 | 8192 |
| Long-form generation | 32 | 4096 |
Increase max-num-batched-tokens to favor throughput over latency. Decrease max-num-seqs if individual requests starve.
Priority scheduling
--scheduling-policy priority
With this set, requests with higher priority field jump the queue. Useful for interactive vs batch traffic on the same endpoint.
Tensor Parallelism & Multi-GPU {#tensor-parallel}
For models that do not fit on a single GPU, split the matmuls.
# 2x RTX 4090 — Llama 3.1 70B AWQ
vllm serve casperhansen/llama-3.1-70b-instruct-awq \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92
Pipeline parallel (multi-node)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray
This splits a 405B model across 16 GPUs (2 nodes of 8 H100s) using TP within node and PP across nodes. Requires Ray cluster: ray start --head on node 1, ray start --address=... on node 2.
NCCL tuning for multi-GPU
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_DISABLE=1 # workstations only; enable on InfiniBand clusters
export NCCL_DEBUG=WARN
See CUDA Optimization for the full multi-GPU theory.
Prefix Caching & Chunked Prefill {#prefix-caching}
Prefix caching
--enable-prefix-caching
Caches KV state for repeated prompt prefixes (system prompts, few-shot examples, RAG retrievals). For agent workloads with long system prompts, this delivers 10-100x lower time-to-first-token on cached prefixes. Free win — always enable in production.
Chunked prefill
--enable-chunked-prefill
--max-num-batched-tokens 8192
Splits long prompt prefill into chunks that interleave with decode steps from other requests. A 32K-token prompt no longer blocks the queue for 5 seconds; it shares the GPU with shorter requests. Critical for mixed RAG + chat workloads.
Disk-based KV cache (LMCache)
For very large prefix-cache working sets:
pip install lmcache
vllm serve <model> \
--enable-prefix-caching \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1"}'
LMCache spills cold prefix-cache entries to CPU RAM and disk. Useful for retrieval-heavy workloads where the same passages get reused across many users.
Speculative Decoding {#speculative}
Draft tokens with a small model, verify with the big one. See CUDA Optimization for the theory.
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5
Pair models with identical tokenizers. Llama 3.1 70B + Llama 3.2 1B is the canonical pair (same vocab). Expected speedup: 1.5-2.5x at single-user batch size 1; less under high concurrency.
EAGLE / EAGLE-2
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-model yuhuili/EAGLE-LLaMA3.1-Instruct-8B \
--speculative-draft-tensor-parallel-size 1
EAGLE is faster than vanilla speculative decoding (2.5-3.5x) because it shares the target model's hidden states.
N-gram (no draft model)
--speculative-model "[ngram]" --ngram-prompt-lookup-max 4
Looks for repeated n-grams from the prompt itself. Excellent for code generation and RAG where output reuses prompt content. Free speedup with zero extra memory.
Tool Calling & Structured Outputs {#tool-calling}
Tool / function calling
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Weather in Tokyo?"}],
tools=tools,
)
vLLM supports tool calling for Llama 3.1+, Hermes, Mistral, and Qwen 2.5+. Pass --enable-auto-tool-choice --tool-call-parser <model_family> to enable parsing.
Structured outputs (JSON Schema)
resp = client.chat.completions.create(
model="...",
messages=[...],
response_format={
"type": "json_schema",
"json_schema": {"name": "person", "schema": {...}},
},
)
vLLM uses xgrammar (default) or outlines for constrained generation. Output is guaranteed schema-valid. Throughput overhead is 5-15%.
--guided-decoding-backend xgrammar # default, fastest
--guided-decoding-backend outlines # alternative, sometimes more compatible
Docker Deployment {#docker}
Single-container (simple)
docker run -d --name vllm \
--restart unless-stopped \
--runtime nvidia --gpus all \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN=$HF_TOKEN \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 16384 \
--enable-prefix-caching \
--gpu-memory-utilization 0.92
docker-compose.yml
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
ipc: host
restart: unless-stopped
environment:
HF_TOKEN: ${HF_TOKEN}
volumes:
- hf-cache:/root/.cache/huggingface
ports:
- "8000:8000"
command: >
--model meta-llama/Llama-3.1-8B-Instruct
--max-model-len 16384
--enable-prefix-caching
--gpu-memory-utilization 0.92
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
hf-cache:
Pre-pulling models
Models cached in ~/.cache/huggingface survive container restarts. Pre-download once on the host:
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct
Kubernetes & KServe {#kubernetes}
Bare Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-8b
spec:
replicas: 1
selector:
matchLabels: { app: vllm-llama-8b }
template:
metadata:
labels: { app: vllm-llama-8b }
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-3.1-8B-Instruct
- --max-model-len
- "16384"
- --enable-prefix-caching
- --gpu-memory-utilization
- "0.92"
ports:
- { name: http, containerPort: 8000 }
env:
- name: HF_TOKEN
valueFrom: { secretKeyRef: { name: hf-token, key: token } }
resources:
limits: { nvidia.com/gpu: "1" }
readinessProbe:
httpGet: { path: /health, port: http }
initialDelaySeconds: 60
periodSeconds: 10
volumeMounts:
- { name: hf-cache, mountPath: /root/.cache/huggingface }
- { name: shm, mountPath: /dev/shm }
volumes:
- name: hf-cache
persistentVolumeClaim: { claimName: hf-cache-pvc }
- name: shm
emptyDir: { medium: Memory, sizeLimit: 16Gi }
The /dev/shm volume is critical — without it, multi-worker setups fail with cryptic CUDA IPC errors.
KServe InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-1-8b
spec:
predictor:
model:
modelFormat: { name: vllm }
args:
- --max-model-len=16384
- --enable-prefix-caching
runtime: kserve-vllmserver
storageUri: hf://meta-llama/Llama-3.1-8B-Instruct
resources:
limits: { nvidia.com/gpu: "1" }
KServe handles autoscaling (KEDA / native), traffic splitting for canary deploys, and GitOps integration.
Horizontal Pod Autoscaler
vLLM exposes vllm:num_requests_running and vllm:num_requests_waiting via Prometheus. Scale on queue depth:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: vllm-hpa }
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-llama-8b
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric: { name: vllm_num_requests_waiting }
target: { type: AverageValue, averageValue: "5" }
Observability: Prometheus, Tracing, Logging {#observability}
Prometheus metrics
vLLM exposes /metrics natively. Scrape every 15s.
Key metrics:
| Metric | Meaning |
|---|---|
vllm:num_requests_running | Currently decoding |
vllm:num_requests_waiting | Queued |
vllm:gpu_cache_usage_perc | KV-cache utilization |
vllm:time_to_first_token_seconds | TTFT histogram |
vllm:time_per_output_token_seconds | Per-token latency |
vllm:e2e_request_latency_seconds | Total request latency |
vllm:prompt_tokens_total | Aggregate prompt tokens |
vllm:generation_tokens_total | Aggregate generated tokens |
Grafana dashboard
Import dashboard ID 19655 (community vLLM dashboard). Key panels: TTFT p50/p95/p99, throughput tok/s, KV-cache utilization, queue depth.
OpenTelemetry tracing
vllm serve <model> --otlp-traces-endpoint http://otel-collector:4317
Traces include per-request prefill/decode timings, scheduling delays, and inter-token gaps.
Structured logging
export VLLM_LOGGING_CONFIG_PATH=./logging.json
Configure JSON logs for log aggregators (Loki, Elasticsearch, Datadog).
Authentication, Rate Limiting, Multi-Tenancy {#auth}
vLLM does not have built-in auth — put it behind a gateway.
Pattern: LiteLLM proxy
# litellm_config.yaml
model_list:
- model_name: llama-3-1-8b
litellm_params:
model: openai/meta-llama/Llama-3.1-8B-Instruct
api_base: http://vllm:8000/v1
api_key: dummy
litellm_settings:
master_key: sk-litellm-master
database_url: postgresql://...
general_settings:
enforce_user_param: true
budget_duration: 30d
LiteLLM handles per-key rate limits, budget tracking, fallbacks, and request routing across multiple vLLM replicas.
Pattern: Envoy / Istio
For Kubernetes-native auth, use Istio AuthorizationPolicy or Envoy filters with API key headers. Combine with Ollama rate limiting patterns adapted to vLLM.
Benchmarking Your Deployment {#benchmarking}
Built-in benchmark scripts
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
python benchmark_serving.py \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000 \
--request-rate 10
Outputs:
- Throughput (req/s and tok/s)
- TTFT p50/p95/p99
- ITL (inter-token latency) p50/p95/p99
- Acceptance rate (with speculative decoding)
Real benchmarks
RTX 4090 (24GB), Llama 3.1 8B AWQ, FP8 KV cache:
| Concurrency | TTFT p50 | TTFT p99 | Tok/s aggregate | GPU util |
|---|---|---|---|---|
| 1 | 35 ms | 55 ms | 142 | 78% |
| 4 | 48 ms | 110 ms | 480 | 95% |
| 16 | 95 ms | 280 ms | 1,150 | 99% |
| 32 | 170 ms | 510 ms | 1,720 | 99% |
| 64 | 320 ms | 1,100 ms | 2,200 | 99% |
Tradeoff: 64 concurrent requests gets 15x more aggregate throughput than 1 — but tail latency degrades. Set HPA targets to keep p95 TTFT under 500ms.
Tuning Recipes by GPU {#tuning-recipes}
RTX 3090 / 3090 Ti (24GB Ampere)
vllm serve casperhansen/llama-3.1-8b-instruct-awq \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 8192 \
--max-num-seqs 256
RTX 4090 (24GB Ada)
vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 \
--gpu-memory-utilization 0.93 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384
RTX 5090 (32GB Blackwell)
vllm serve casperhansen/llama-3.1-70b-instruct-awq \
--quantization awq \
--kv-cache-dtype fp8_e4m3 \
--max-model-len 32768 \
--gpu-memory-utilization 0.94 \
--enable-prefix-caching \
--enable-chunked-prefill
2x RTX 4090 (48GB total, PCIe)
vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
8x H100 (640GB total)
vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
--quantization fp8 \
--tensor-parallel-size 8 \
--max-model-len 65536 \
--gpu-memory-utilization 0.94 \
--enable-prefix-caching \
--enable-chunked-prefill
Common Errors & Fixes {#troubleshooting}
| Error | Cause | Fix |
|---|---|---|
CUDA out of memory at startup | Model + KV cache exceeds VRAM | Lower --max-model-len, use AWQ/FP8, or reduce --gpu-memory-utilization |
CUDA out of memory mid-request | Too many concurrent sequences | Lower --max-num-seqs or --max-num-batched-tokens |
AssertionError: Cannot find model in HF cache | Not authenticated | huggingface-cli login or set HF_TOKEN |
RuntimeError: NCCL error | TP across GPUs without proper IPC | Add --ipc=host to docker run; mount /dev/shm in K8s |
Speculative decoding accept rate < 50% | Mismatched draft/target tokenizer | Use same model family for draft and target |
TTFT > 5s on first request | Cold start, kernel autotune | Warm up with 5-10 dummy requests after start |
Throughput drops after hours of uptime | KV-cache fragmentation (rare) | Restart; tune --block-size 32 for long-context workloads |
Pip install fails with torch conflict | System torch version mismatch | Fresh venv; never install over system torch |
FAQ {#faq}
See answers to common vLLM questions below.
Sources: vLLM documentation | PagedAttention paper (Kwon et al., 2023) | vLLM GitHub | NeuralMagic FP8 model collection | Internal benchmarks on RTX 3090, 4090, 5090, A100, H100.
Related guides on Local AI Master:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!