★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Production

vLLM Complete Setup Guide for Local LLMs (2026): Install, Tune, Serve

May 1, 2026
32 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

vLLM delivers 5-20x higher aggregate throughput than Ollama on multi-user workloads — and with the right configuration, lower per-token latency too. This guide covers everything: installation, model loading, quantization, tensor parallelism, OpenAI-compatible serving, prefix caching, chunked prefill, Docker, Kubernetes, observability, and tuning playbooks for the most common GPUs.

If you serve more than one concurrent user, vLLM is the upgrade that pays for itself the same day you deploy it.

Table of Contents

  1. What vLLM Is, In One Page
  2. vLLM vs Ollama vs llama.cpp vs TensorRT-LLM
  3. System Requirements
  4. Installation
  5. Your First Model in 60 Seconds
  6. OpenAI-Compatible API
  7. Quantization: AWQ, GPTQ, FP8, INT8
  8. PagedAttention & KV-Cache Tuning
  9. Continuous Batching & Scheduling
  10. Tensor Parallelism & Multi-GPU
  11. Prefix Caching & Chunked Prefill
  12. Speculative Decoding
  13. Tool Calling & Structured Outputs
  14. Docker Deployment
  15. Kubernetes & KServe
  16. Observability: Prometheus, Tracing, Logging
  17. Authentication, Rate Limiting, Multi-Tenancy
  18. Benchmarking Your Deployment
  19. Tuning Recipes by GPU
  20. Common Errors & Fixes
  21. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What vLLM Is, In One Page {#what-vllm-is}

vLLM is an open-source inference and serving engine for LLMs, originally built at UC Berkeley (PagedAttention paper, 2023). It is written in Python with custom CUDA kernels and is now the de facto standard for self-hosted high-throughput LLM serving.

Core innovations:

  • PagedAttention — KV cache stored in fixed-size blocks, near-zero fragmentation.
  • Continuous batching — schedules at the iteration level, no idle GPU.
  • Optimized kernels — FlashAttention 2/3, FP8, AWQ-INT4 dequant fused into matmul.
  • OpenAI-compatible API — drop-in for any client expecting OpenAI endpoints.
  • Distributed inference — tensor parallel + pipeline parallel + Ray-based multi-node.

What it is not: a desktop app, a fine-tuning tool, or a model converter. It is a server. Pair it with LiteLLM for routing, Open WebUI for chat UI, and Langfuse for tracing.


vLLM vs Ollama vs llama.cpp vs TensorRT-LLM {#comparison}

FeatureOllamallama.cppvLLMTensorRT-LLMSGLang
Single-user latencyExcellentExcellentGoodBestExcellent
Aggregate throughput (32 users)PoorPoorExcellentExcellentExcellent
Multi-user concurrency1-41-4256+128+256+
OpenAI APIvia Triton
Quantization formatsGGUFGGUFAWQ, GPTQ, FP8, INT8, GGUFINT4-AWQ, FP8, INT8AWQ, FP8
Prefix cachingpartial✅ (best)
Tensor parallelpartial
Pipeline parallelbasicbasic
FP8 (Ada/Hopper)
Setup time60s5min5min30-60min5min
Best forDesktopDesktop / edgeProduction serversLowest latencyAgent frameworks

Decision rule: if more than 4 users hit your endpoint at once, switch to vLLM. If you need the lowest single-stream latency for a critical path, use TensorRT-LLM. Otherwise stay on Ollama for simplicity.


System Requirements {#requirements}

ComponentMinimumRecommended
GPUCompute capability 7.0+ (V100/T4/RTX 20-series)Ampere or newer (RTX 30/40/50, A100, H100)
VRAM (per model size)8B BF16: 18GB / 8B AWQ: 8GB / 70B AWQ: 40GB+30% headroom for KV cache
Driver535+550+ (FP8 needs Ada/Hopper)
CUDA Toolkit12.112.4+ (build only; runtime uses driver)
Python3.93.10-3.12
RAM32GB64GB+ for 70B models
Disk100GB freeNVMe; models can be 200GB+ in BF16
OSLinux (Ubuntu 22.04+)Ubuntu 22.04 LTS

Windows native is not officially supported; use WSL2. macOS is not supported (use llama.cpp / Ollama or MLX).


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation {#installation}

Option 1: pip (most common)

# Fresh venv, always
python3.11 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip

# vLLM with CUDA 12.4 wheels
pip install vllm

# Verify
python -c "import vllm; print(vllm.__version__)"
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct

--ipc=host is required because vLLM uses CUDA shared memory between worker processes.

Option 3: from source (custom CUDA, AMD ROCm, custom kernels)

git clone https://github.com/vllm-project/vllm.git
cd vllm
export TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0 12.0"   # Ampere, Ada, Hopper, Blackwell
pip install -e .

Building from source takes 10-30 minutes depending on GPU count. Set MAX_JOBS=8 to limit RAM usage.

Option 4: AMD ROCm (experimental)

docker pull rocm/vllm:latest
docker run --device /dev/kfd --device /dev/dri --group-add video \
    -p 8000:8000 \
    rocm/vllm:latest \
    --model meta-llama/Llama-3.1-8B-Instruct

ROCm support targets MI300X, MI250X, and Radeon RX 7900 XTX. Performance is 60-80% of CUDA equivalents. See our AMD ROCm Setup Guide for details.


Your First Model in 60 Seconds {#first-model}

# Authenticate with Hugging Face if the model is gated
huggingface-cli login

# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

In another terminal:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
        "max_tokens": 200,
        "temperature": 0.7
    }'

That's the entire baseline setup. Everything else in this guide is tuning.


OpenAI-Compatible API {#openai-api}

vLLM implements the OpenAI REST API spec. Any OpenAI client works unchanged with a different base_url.

Python (official OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Endpoints

EndpointPurpose
POST /v1/chat/completionsChat with messages (most common)
POST /v1/completionsLegacy completions API
POST /v1/embeddingsVector embeddings (with embedding models)
GET /v1/modelsList loaded models
GET /healthLiveness probe
GET /metricsPrometheus metrics

Streaming, tool calls, structured outputs

All work. Set stream: true for SSE, pass tools for function calling, and response_format: { type: "json_schema", json_schema: {...} } for guaranteed-valid JSON output.


Quantization: AWQ, GPTQ, FP8, INT8 {#quantization}

Quantization is the biggest single performance lever after fitting in VRAM. See our CUDA optimization guide and AWQ vs GPTQ vs GGUF for the underlying theory; here are the vLLM-specific recipes.

AWQ-INT4 (best general purpose)

vllm serve casperhansen/llama-3.1-8b-instruct-awq \
    --quantization awq \
    --max-model-len 16384

Quality is typically within 1% of BF16; size is ~4x smaller. Available for almost every popular model on Hugging Face.

FP8 (Ada / Blackwell / Hopper only)

vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768 \
    --tensor-parallel-size 2

FP8 uses native Tensor Core support and roughly doubles throughput vs BF16 on RTX 4090 / 5090 / H100. Requires a model published in FP8 format; NeuralMagic and Meta publish FP8 versions of popular Llama models.

GPTQ-INT4

vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
    --quantization gptq \
    --max-model-len 4096

Older but still common. Group size 128 is standard.

W8A8 INT8 (SmoothQuant / RTN)

vllm serve neuralmagic/Llama-3.1-8B-Instruct-quantized.w8a8 \
    --quantization compressed-tensors

Slightly higher quality than INT4 at 2x the size; useful when you have plenty of VRAM and want maximum throughput.

Quantization decision tree

  • Have Ada/Blackwell/Hopper + FP8 model exists? → FP8.
  • Need broad model coverage? → AWQ-INT4.
  • Plenty of VRAM, max quality? → BF16 (no quantization).
  • Tight VRAM budget on Ampere? → AWQ-INT4 or GPTQ-INT4.

PagedAttention & KV-Cache Tuning {#paged-attention}

PagedAttention is automatic — you do not configure it directly. But you do control how much memory it gets.

Block size

--block-size 16          # default, best for most workloads
--block-size 32          # better for long contexts (>32K)

Larger blocks reduce metadata overhead but increase fragmentation. 16 is a good default; bump to 32 if you serve only long-context requests.

GPU memory utilization

--gpu-memory-utilization 0.92

vLLM allocates this fraction of total VRAM for model weights + KV cache. Default is 0.90. Push to 0.95-0.97 on dedicated inference boxes; keep 0.85-0.90 if you also run other GPU workloads.

KV-cache dtype

--kv-cache-dtype fp8_e4m3   # Ada+ only — recommended for long context
--kv-cache-dtype fp8_e5m2   # Better range, slightly worse precision
--kv-cache-dtype auto       # Match model dtype

FP8 KV cache halves the cache memory vs BF16. On Llama 3.1 70B at 32K context, this frees ~10GB of VRAM that becomes available for more concurrent requests.

Max model length

--max-model-len 32768

vLLM pre-allocates KV-cache space for the longest possible sequence. Setting this lower than the model's native context length directly increases concurrency. If you only need 4K context, set --max-model-len 4096 and you can fit 4-8x more concurrent requests.


Continuous Batching & Scheduling {#continuous-batching}

The scheduler decides which requests run at each step. Two key parameters:

--max-num-seqs 256                # max concurrent sequences
--max-num-batched-tokens 8192     # max tokens computed per step

max-num-seqs caps concurrency; max-num-batched-tokens caps the per-step compute. The right values depend on workload:

Workloadmax-num-seqsmax-num-batched-tokens
Chat (short prompts, short replies)2564096
RAG (long prompts, short replies)6416384
Code completion (medium both)1288192
Long-form generation324096

Increase max-num-batched-tokens to favor throughput over latency. Decrease max-num-seqs if individual requests starve.

Priority scheduling

--scheduling-policy priority

With this set, requests with higher priority field jump the queue. Useful for interactive vs batch traffic on the same endpoint.


Tensor Parallelism & Multi-GPU {#tensor-parallel}

For models that do not fit on a single GPU, split the matmuls.

# 2x RTX 4090 — Llama 3.1 70B AWQ
vllm serve casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq \
    --tensor-parallel-size 2 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92

Pipeline parallel (multi-node)

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray

This splits a 405B model across 16 GPUs (2 nodes of 8 H100s) using TP within node and PP across nodes. Requires Ray cluster: ray start --head on node 1, ray start --address=... on node 2.

NCCL tuning for multi-GPU

export NCCL_P2P_LEVEL=NVL
export NCCL_IB_DISABLE=1     # workstations only; enable on InfiniBand clusters
export NCCL_DEBUG=WARN

See CUDA Optimization for the full multi-GPU theory.


Prefix Caching & Chunked Prefill {#prefix-caching}

Prefix caching

--enable-prefix-caching

Caches KV state for repeated prompt prefixes (system prompts, few-shot examples, RAG retrievals). For agent workloads with long system prompts, this delivers 10-100x lower time-to-first-token on cached prefixes. Free win — always enable in production.

Chunked prefill

--enable-chunked-prefill
--max-num-batched-tokens 8192

Splits long prompt prefill into chunks that interleave with decode steps from other requests. A 32K-token prompt no longer blocks the queue for 5 seconds; it shares the GPU with shorter requests. Critical for mixed RAG + chat workloads.

Disk-based KV cache (LMCache)

For very large prefix-cache working sets:

pip install lmcache
vllm serve <model> \
    --enable-prefix-caching \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1"}'

LMCache spills cold prefix-cache entries to CPU RAM and disk. Useful for retrieval-heavy workloads where the same passages get reused across many users.


Speculative Decoding {#speculative}

Draft tokens with a small model, verify with the big one. See CUDA Optimization for the theory.

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5

Pair models with identical tokenizers. Llama 3.1 70B + Llama 3.2 1B is the canonical pair (same vocab). Expected speedup: 1.5-2.5x at single-user batch size 1; less under high concurrency.

EAGLE / EAGLE-2

vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --speculative-model yuhuili/EAGLE-LLaMA3.1-Instruct-8B \
    --speculative-draft-tensor-parallel-size 1

EAGLE is faster than vanilla speculative decoding (2.5-3.5x) because it shares the target model's hidden states.

N-gram (no draft model)

--speculative-model "[ngram]" --ngram-prompt-lookup-max 4

Looks for repeated n-grams from the prompt itself. Excellent for code generation and RAG where output reuses prompt content. Free speedup with zero extra memory.


Tool Calling & Structured Outputs {#tool-calling}

Tool / function calling

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Weather in Tokyo?"}],
    tools=tools,
)

vLLM supports tool calling for Llama 3.1+, Hermes, Mistral, and Qwen 2.5+. Pass --enable-auto-tool-choice --tool-call-parser <model_family> to enable parsing.

Structured outputs (JSON Schema)

resp = client.chat.completions.create(
    model="...",
    messages=[...],
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "person", "schema": {...}},
    },
)

vLLM uses xgrammar (default) or outlines for constrained generation. Output is guaranteed schema-valid. Throughput overhead is 5-15%.

--guided-decoding-backend xgrammar     # default, fastest
--guided-decoding-backend outlines     # alternative, sometimes more compatible

Docker Deployment {#docker}

Single-container (simple)

docker run -d --name vllm \
    --restart unless-stopped \
    --runtime nvidia --gpus all \
    --ipc=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 16384 \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.92

docker-compose.yml

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    ipc: host
    restart: unless-stopped
    environment:
      HF_TOKEN: ${HF_TOKEN}
    volumes:
      - hf-cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --max-model-len 16384
      --enable-prefix-caching
      --gpu-memory-utilization 0.92
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  hf-cache:

Pre-pulling models

Models cached in ~/.cache/huggingface survive container restarts. Pre-download once on the host:

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct

Kubernetes & KServe {#kubernetes}

Bare Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-8b
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-llama-8b }
  template:
    metadata:
      labels: { app: vllm-llama-8b }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model
            - meta-llama/Llama-3.1-8B-Instruct
            - --max-model-len
            - "16384"
            - --enable-prefix-caching
            - --gpu-memory-utilization
            - "0.92"
          ports:
            - { name: http, containerPort: 8000 }
          env:
            - name: HF_TOKEN
              valueFrom: { secretKeyRef: { name: hf-token, key: token } }
          resources:
            limits: { nvidia.com/gpu: "1" }
          readinessProbe:
            httpGet: { path: /health, port: http }
            initialDelaySeconds: 60
            periodSeconds: 10
          volumeMounts:
            - { name: hf-cache, mountPath: /root/.cache/huggingface }
            - { name: shm, mountPath: /dev/shm }
      volumes:
        - name: hf-cache
          persistentVolumeClaim: { claimName: hf-cache-pvc }
        - name: shm
          emptyDir: { medium: Memory, sizeLimit: 16Gi }

The /dev/shm volume is critical — without it, multi-worker setups fail with cryptic CUDA IPC errors.

KServe InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-1-8b
spec:
  predictor:
    model:
      modelFormat: { name: vllm }
      args:
        - --max-model-len=16384
        - --enable-prefix-caching
      runtime: kserve-vllmserver
      storageUri: hf://meta-llama/Llama-3.1-8B-Instruct
      resources:
        limits: { nvidia.com/gpu: "1" }

KServe handles autoscaling (KEDA / native), traffic splitting for canary deploys, and GitOps integration.

Horizontal Pod Autoscaler

vLLM exposes vllm:num_requests_running and vllm:num_requests_waiting via Prometheus. Scale on queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: vllm-hpa }
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama-8b
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric: { name: vllm_num_requests_waiting }
        target: { type: AverageValue, averageValue: "5" }

Observability: Prometheus, Tracing, Logging {#observability}

Prometheus metrics

vLLM exposes /metrics natively. Scrape every 15s.

Key metrics:

MetricMeaning
vllm:num_requests_runningCurrently decoding
vllm:num_requests_waitingQueued
vllm:gpu_cache_usage_percKV-cache utilization
vllm:time_to_first_token_secondsTTFT histogram
vllm:time_per_output_token_secondsPer-token latency
vllm:e2e_request_latency_secondsTotal request latency
vllm:prompt_tokens_totalAggregate prompt tokens
vllm:generation_tokens_totalAggregate generated tokens

Grafana dashboard

Import dashboard ID 19655 (community vLLM dashboard). Key panels: TTFT p50/p95/p99, throughput tok/s, KV-cache utilization, queue depth.

OpenTelemetry tracing

vllm serve <model> --otlp-traces-endpoint http://otel-collector:4317

Traces include per-request prefill/decode timings, scheduling delays, and inter-token gaps.

Structured logging

export VLLM_LOGGING_CONFIG_PATH=./logging.json

Configure JSON logs for log aggregators (Loki, Elasticsearch, Datadog).


Authentication, Rate Limiting, Multi-Tenancy {#auth}

vLLM does not have built-in auth — put it behind a gateway.

Pattern: LiteLLM proxy

# litellm_config.yaml
model_list:
  - model_name: llama-3-1-8b
    litellm_params:
      model: openai/meta-llama/Llama-3.1-8B-Instruct
      api_base: http://vllm:8000/v1
      api_key: dummy

litellm_settings:
  master_key: sk-litellm-master
  database_url: postgresql://...

general_settings:
  enforce_user_param: true
  budget_duration: 30d

LiteLLM handles per-key rate limits, budget tracking, fallbacks, and request routing across multiple vLLM replicas.

Pattern: Envoy / Istio

For Kubernetes-native auth, use Istio AuthorizationPolicy or Envoy filters with API key headers. Combine with Ollama rate limiting patterns adapted to vLLM.


Benchmarking Your Deployment {#benchmarking}

Built-in benchmark scripts

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks

python benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1000 \
    --request-rate 10

Outputs:

  • Throughput (req/s and tok/s)
  • TTFT p50/p95/p99
  • ITL (inter-token latency) p50/p95/p99
  • Acceptance rate (with speculative decoding)

Real benchmarks

RTX 4090 (24GB), Llama 3.1 8B AWQ, FP8 KV cache:

ConcurrencyTTFT p50TTFT p99Tok/s aggregateGPU util
135 ms55 ms14278%
448 ms110 ms48095%
1695 ms280 ms1,15099%
32170 ms510 ms1,72099%
64320 ms1,100 ms2,20099%

Tradeoff: 64 concurrent requests gets 15x more aggregate throughput than 1 — but tail latency degrades. Set HPA targets to keep p95 TTFT under 500ms.


Tuning Recipes by GPU {#tuning-recipes}

RTX 3090 / 3090 Ti (24GB Ampere)

vllm serve casperhansen/llama-3.1-8b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256

RTX 4090 (24GB Ada)

vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.93 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 16384

RTX 5090 (32GB Blackwell)

vllm serve casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.94 \
    --enable-prefix-caching \
    --enable-chunked-prefill

2x RTX 4090 (48GB total, PCIe)

vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching

8x H100 (640GB total)

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --quantization fp8 \
    --tensor-parallel-size 8 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.94 \
    --enable-prefix-caching \
    --enable-chunked-prefill

Common Errors & Fixes {#troubleshooting}

ErrorCauseFix
CUDA out of memory at startupModel + KV cache exceeds VRAMLower --max-model-len, use AWQ/FP8, or reduce --gpu-memory-utilization
CUDA out of memory mid-requestToo many concurrent sequencesLower --max-num-seqs or --max-num-batched-tokens
AssertionError: Cannot find model in HF cacheNot authenticatedhuggingface-cli login or set HF_TOKEN
RuntimeError: NCCL errorTP across GPUs without proper IPCAdd --ipc=host to docker run; mount /dev/shm in K8s
Speculative decoding accept rate < 50%Mismatched draft/target tokenizerUse same model family for draft and target
TTFT > 5s on first requestCold start, kernel autotuneWarm up with 5-10 dummy requests after start
Throughput drops after hours of uptimeKV-cache fragmentation (rare)Restart; tune --block-size 32 for long-context workloads
Pip install fails with torch conflictSystem torch version mismatchFresh venv; never install over system torch

FAQ {#faq}

See answers to common vLLM questions below.


Sources: vLLM documentation | PagedAttention paper (Kwon et al., 2023) | vLLM GitHub | NeuralMagic FP8 model collection | Internal benchmarks on RTX 3090, 4090, 5090, A100, H100.

Related guides on Local AI Master:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks — including a vLLM + Open WebUI + Postgres reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators