What is vLLM and why is it faster than Ollama or llama.cpp?

vLLM is an open-source inference server built around PagedAttention — a KV-cache memory manager that stores attention keys and values in fixed-size blocks (default 16 tokens). This eliminates the 60-80% memory fragmentation that traditional schedulers suffer, so vLLM fits 2-4x more concurrent requests in the same VRAM. Combined with continuous batching (swapping new requests in at every decoding step) and CUDA graphs, vLLM delivers 5-20x higher aggregate throughput than Ollama or llama.cpp once concurrency exceeds 4-8 simultaneous users. For single-user desktop use, the gap is small — pick the right tool for the workload.

What hardware do I need to run vLLM?

vLLM requires an NVIDIA GPU with compute capability 7.0 or higher (Volta, Turing, Ampere, Ada, Hopper, Blackwell). Practical minimum: 12GB VRAM for 7B/8B models in BF16, 16GB for 14B with AWQ-INT4, 24GB for 32B with AWQ, 48GB (2x 24GB or single A6000) for 70B with AWQ. AMD ROCm support exists but is experimental. CPU-only is not supported. For production, an RTX 3090, 4090, 5090, A6000, L40S, or H100 are the most common choices. RAM should be at least 2x your model size for loading buffers; 64GB system RAM is recommended for 70B models.

How do I install vLLM and avoid CUDA / PyTorch version conflicts?

The cleanest install is in a fresh Python 3.10-3.12 venv with `pip install vllm`. vLLM ships pre-built wheels for CUDA 12.1 and 12.4. If you need a different CUDA version, build from source with the matching `TORCH_CUDA_ARCH_LIST`. The most common failure is mixing system PyTorch with vLLM's expected version — always start fresh. For Docker, use `vllm/vllm-openai:latest` which pins all versions correctly. Common error "Failed to load CUDA driver" usually means your driver is older than CUDA toolkit 12.4 expects (need driver 550+). Run `nvidia-smi` to check.

Should I use AWQ, GPTQ, or FP8 quantization with vLLM?

For Ada (RTX 40-series) and Blackwell (RTX 50-series), FP8 is fastest and highest quality but the model must be available in FP8 format (Meta and NVIDIA publish FP8 versions of popular Llama variants). For broader compatibility on Ampere (RTX 30-series, A100), use AWQ-INT4 — it has the best quality at 4 bits and excellent vLLM kernels. GPTQ works but AWQ generally beats it on quality. Avoid GGUF in vLLM — it is supported but the kernels are not as optimized as AWQ. SqueezeLLM and SmoothQuant W8A8 are also supported but rarely the right choice over AWQ or FP8.

How does continuous batching work and what does it do for throughput?

Static batching waits for a batch to fill, runs all sequences to completion together, and idles the GPU when sequences finish at different times. Continuous batching swaps in new requests at every single decoding step, so the GPU is never idle. Combined with PagedAttention this means you can serve 32+ concurrent requests on a single RTX 4090 with Llama 3.1 8B at ~2,200 tok/s aggregate, vs ~145 tok/s on Ollama with the same hardware. Set `--max-num-seqs 256` and `--max-num-batched-tokens 8192` as starting points and tune based on your latency targets.

How do I expose vLLM as an OpenAI-compatible API?

vLLM ships with a built-in OpenAI-compatible server. Run `vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000` and the endpoints `POST /v1/chat/completions`, `POST /v1/completions`, and `POST /v1/embeddings` (with embedding models) are immediately available. Set the OpenAI client base URL to `http://localhost:8000/v1` and use any dummy API key. Tool calling, structured outputs (JSON schema), and streaming all work. For multi-tenancy, put it behind LiteLLM or an API gateway and pass real keys per tenant.

Can I run vLLM in Docker or Kubernetes?

Yes — production deployments overwhelmingly use containers. The official `vllm/vllm-openai` image needs `--gpus all` (Docker) or NVIDIA device plugin (Kubernetes). For Kubernetes, mount models on a ReadOnlyMany PVC, use a Deployment with `nvidia.com/gpu: 1` resource request, and set readiness probes on `/health`. KServe and Ray Serve both have first-class vLLM integrations. For multi-GPU across nodes, vLLM supports Ray-based distributed serving — set `--distributed-executor-backend ray` and start a Ray cluster.

How does vLLM compare to TensorRT-LLM and SGLang?

TensorRT-LLM has the lowest single-stream latency (typically 20-40% better than vLLM) thanks to fused kernels and hand-tuned CUDA, but requires building per-GPU engines and has narrower model support. SGLang is closest to vLLM in throughput and adds a structured generation language for complex agent workflows; it can outperform vLLM on prefix-cache-heavy workloads by 30-50%. Pick vLLM for the broadest model coverage and easiest ops, TensorRT-LLM for lowest latency on production-critical paths, SGLang for agent frameworks with heavy reuse of prompts. For 90% of self-hosters, vLLM is the right default.

vLLM Complete Setup Guide for Local LLMs (2026): Install, Tune, Serve

vLLM delivers 5-20x higher aggregate throughput than Ollama on multi-user workloads — and with the right configuration, lower per-token latency too. This guide covers everything: installation, model loading, quantization, tensor parallelism, OpenAI-compatible serving, prefix caching, chunked prefill, Docker, Kubernetes, observability, and tuning playbooks for the most common GPUs.

If you serve more than one concurrent user, vLLM is the upgrade that pays for itself the same day you deploy it.

What vLLM Is, In One Page
vLLM vs Ollama vs llama.cpp vs TensorRT-LLM
System Requirements
Installation
Your First Model in 60 Seconds
OpenAI-Compatible API
Quantization: AWQ, GPTQ, FP8, INT8
PagedAttention & KV-Cache Tuning
Continuous Batching & Scheduling
Tensor Parallelism & Multi-GPU
Prefix Caching & Chunked Prefill
Speculative Decoding
Tool Calling & Structured Outputs
Docker Deployment
Kubernetes & KServe
Observability: Prometheus, Tracing, Logging
Authentication, Rate Limiting, Multi-Tenancy
Benchmarking Your Deployment
Tuning Recipes by GPU
Common Errors & Fixes
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What vLLM Is, In One Page {#what-vllm-is}

vLLM is an open-source inference and serving engine for LLMs, originally built at UC Berkeley (PagedAttention paper, 2023). It is written in Python with custom CUDA kernels and is now the de facto standard for self-hosted high-throughput LLM serving.

Core innovations:

PagedAttention — KV cache stored in fixed-size blocks, near-zero fragmentation.
Continuous batching — schedules at the iteration level, no idle GPU.
Optimized kernels — FlashAttention 2/3, FP8, AWQ-INT4 dequant fused into matmul.
OpenAI-compatible API — drop-in for any client expecting OpenAI endpoints.
Distributed inference — tensor parallel + pipeline parallel + Ray-based multi-node.

What it is not: a desktop app, a fine-tuning tool, or a model converter. It is a server. Pair it with LiteLLM for routing, Open WebUI for chat UI, and Langfuse for tracing.

vLLM vs Ollama vs llama.cpp vs TensorRT-LLM {#comparison}

Feature	Ollama	llama.cpp	vLLM	TensorRT-LLM	SGLang
Single-user latency	Excellent	Excellent	Good	Best	Excellent
Aggregate throughput (32 users)	Poor	Poor	Excellent	Excellent	Excellent
Multi-user concurrency	1-4	1-4	256+	128+	256+
OpenAI API	✅	✅	✅	via Triton	✅
Quantization formats	GGUF	GGUF	AWQ, GPTQ, FP8, INT8, GGUF	INT4-AWQ, FP8, INT8	AWQ, FP8
Prefix caching	❌	partial	✅	✅	✅ (best)
Tensor parallel	❌	partial	✅	✅	✅
Pipeline parallel	basic	basic	✅	✅	✅
FP8 (Ada/Hopper)	❌	❌	✅	✅	✅
Setup time	60s	5min	5min	30-60min	5min
Best for	Desktop	Desktop / edge	Production servers	Lowest latency	Agent frameworks

Decision rule: if more than 4 users hit your endpoint at once, switch to vLLM. If you need the lowest single-stream latency for a critical path, use TensorRT-LLM. Otherwise stay on Ollama for simplicity.

System Requirements {#requirements}

Component	Minimum	Recommended
GPU	Compute capability 7.0+ (V100/T4/RTX 20-series)	Ampere or newer (RTX 30/40/50, A100, H100)
VRAM (per model size)	8B BF16: 18GB / 8B AWQ: 8GB / 70B AWQ: 40GB	+30% headroom for KV cache
Driver	535+	550+ (FP8 needs Ada/Hopper)
CUDA Toolkit	12.1	12.4+ (build only; runtime uses driver)
Python	3.9	3.10-3.12
RAM	32GB	64GB+ for 70B models
Disk	100GB free	NVMe; models can be 200GB+ in BF16
OS	Linux (Ubuntu 22.04+)	Ubuntu 22.04 LTS

Windows native is not officially supported; use WSL2. macOS is not supported (use llama.cpp / Ollama or MLX).

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Installation {#installation}

Option 1: pip (most common)

# Fresh venv, always
python3.11 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
pip install --upgrade pip

# vLLM with CUDA 12.4 wheels
pip install vllm

# Verify
python -c "import vllm; print(vllm.__version__)"

Option 2: Docker (recommended for production)

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct

--ipc=host is required because vLLM uses CUDA shared memory between worker processes.

Option 3: from source (custom CUDA, AMD ROCm, custom kernels)

git clone https://github.com/vllm-project/vllm.git
cd vllm
export TORCH_CUDA_ARCH_LIST="8.0 8.6 8.9 9.0 12.0"   # Ampere, Ada, Hopper, Blackwell
pip install -e .

Building from source takes 10-30 minutes depending on GPU count. Set MAX_JOBS=8 to limit RAM usage.

Option 4: AMD ROCm (experimental)

docker pull rocm/vllm:latest
docker run --device /dev/kfd --device /dev/dri --group-add video \
    -p 8000:8000 \
    rocm/vllm:latest \
    --model meta-llama/Llama-3.1-8B-Instruct

ROCm support targets MI300X, MI250X, and Radeon RX 7900 XTX. Performance is 60-80% of CUDA equivalents. See our AMD ROCm Setup Guide for details.

Your First Model in 60 Seconds {#first-model}

# Authenticate with Hugging Face if the model is gated
huggingface-cli login

# Start server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90

In another terminal:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
        "max_tokens": 200,
        "temperature": 0.7
    }'

That's the entire baseline setup. Everything else in this guide is tuning.

OpenAI-Compatible API {#openai-api}

vLLM implements the OpenAI REST API spec. Any OpenAI client works unchanged with a different base_url.

Python (official OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-used")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Endpoints

Endpoint	Purpose
`POST /v1/chat/completions`	Chat with messages (most common)
`POST /v1/completions`	Legacy completions API
`POST /v1/embeddings`	Vector embeddings (with embedding models)
`GET /v1/models`	List loaded models
`GET /health`	Liveness probe
`GET /metrics`	Prometheus metrics

Streaming, tool calls, structured outputs

All work. Set stream: true for SSE, pass tools for function calling, and response_format: { type: "json_schema", json_schema: {...} } for guaranteed-valid JSON output.

Quantization: AWQ, GPTQ, FP8, INT8 {#quantization}

Quantization is the biggest single performance lever after fitting in VRAM. See our CUDA optimization guide and AWQ vs GPTQ vs GGUF for the underlying theory; here are the vLLM-specific recipes.

AWQ-INT4 (best general purpose)

vllm serve casperhansen/llama-3.1-8b-instruct-awq \
    --quantization awq \
    --max-model-len 16384

Quality is typically within 1% of BF16; size is ~4x smaller. Available for almost every popular model on Hugging Face.

FP8 (Ada / Blackwell / Hopper only)

vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768 \
    --tensor-parallel-size 2

FP8 uses native Tensor Core support and roughly doubles throughput vs BF16 on RTX 4090 / 5090 / H100. Requires a model published in FP8 format; NeuralMagic and Meta publish FP8 versions of popular Llama models.

GPTQ-INT4

vllm serve TheBloke/Llama-2-13B-chat-GPTQ \
    --quantization gptq \
    --max-model-len 4096

Older but still common. Group size 128 is standard.

W8A8 INT8 (SmoothQuant / RTN)

vllm serve neuralmagic/Llama-3.1-8B-Instruct-quantized.w8a8 \
    --quantization compressed-tensors

Slightly higher quality than INT4 at 2x the size; useful when you have plenty of VRAM and want maximum throughput.

Quantization decision tree

Have Ada/Blackwell/Hopper + FP8 model exists? → FP8.
Need broad model coverage? → AWQ-INT4.
Plenty of VRAM, max quality? → BF16 (no quantization).
Tight VRAM budget on Ampere? → AWQ-INT4 or GPTQ-INT4.

PagedAttention & KV-Cache Tuning {#paged-attention}

PagedAttention is automatic — you do not configure it directly. But you do control how much memory it gets.

Block size

--block-size 16          # default, best for most workloads
--block-size 32          # better for long contexts (>32K)

Larger blocks reduce metadata overhead but increase fragmentation. 16 is a good default; bump to 32 if you serve only long-context requests.

GPU memory utilization

--gpu-memory-utilization 0.92

vLLM allocates this fraction of total VRAM for model weights + KV cache. Default is 0.90. Push to 0.95-0.97 on dedicated inference boxes; keep 0.85-0.90 if you also run other GPU workloads.

KV-cache dtype

--kv-cache-dtype fp8_e4m3   # Ada+ only — recommended for long context
--kv-cache-dtype fp8_e5m2   # Better range, slightly worse precision
--kv-cache-dtype auto       # Match model dtype

FP8 KV cache halves the cache memory vs BF16. On Llama 3.1 70B at 32K context, this frees ~10GB of VRAM that becomes available for more concurrent requests.

Max model length

--max-model-len 32768

vLLM pre-allocates KV-cache space for the longest possible sequence. Setting this lower than the model's native context length directly increases concurrency. If you only need 4K context, set --max-model-len 4096 and you can fit 4-8x more concurrent requests.

Continuous Batching & Scheduling {#continuous-batching}

The scheduler decides which requests run at each step. Two key parameters:

--max-num-seqs 256                # max concurrent sequences
--max-num-batched-tokens 8192     # max tokens computed per step

max-num-seqs caps concurrency; max-num-batched-tokens caps the per-step compute. The right values depend on workload:

Workload	max-num-seqs	max-num-batched-tokens
Chat (short prompts, short replies)	256	4096
RAG (long prompts, short replies)	64	16384
Code completion (medium both)	128	8192
Long-form generation	32	4096

Increase max-num-batched-tokens to favor throughput over latency. Decrease max-num-seqs if individual requests starve.

Priority scheduling

--scheduling-policy priority

With this set, requests with higher priority field jump the queue. Useful for interactive vs batch traffic on the same endpoint.

Tensor Parallelism & Multi-GPU {#tensor-parallel}

For models that do not fit on a single GPU, split the matmuls.

# 2x RTX 4090 — Llama 3.1 70B AWQ
vllm serve casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq \
    --tensor-parallel-size 2 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92

Pipeline parallel (multi-node)

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray

This splits a 405B model across 16 GPUs (2 nodes of 8 H100s) using TP within node and PP across nodes. Requires Ray cluster: ray start --head on node 1, ray start --address=... on node 2.

NCCL tuning for multi-GPU

export NCCL_P2P_LEVEL=NVL
export NCCL_IB_DISABLE=1     # workstations only; enable on InfiniBand clusters
export NCCL_DEBUG=WARN

See CUDA Optimization for the full multi-GPU theory.

Prefix Caching & Chunked Prefill {#prefix-caching}

Prefix caching

--enable-prefix-caching

Caches KV state for repeated prompt prefixes (system prompts, few-shot examples, RAG retrievals). For agent workloads with long system prompts, this delivers 10-100x lower time-to-first-token on cached prefixes. Free win — always enable in production.

Chunked prefill

--enable-chunked-prefill
--max-num-batched-tokens 8192

Splits long prompt prefill into chunks that interleave with decode steps from other requests. A 32K-token prompt no longer blocks the queue for 5 seconds; it shares the GPU with shorter requests. Critical for mixed RAG + chat workloads.

Disk-based KV cache (LMCache)

For very large prefix-cache working sets:

pip install lmcache
vllm serve <model> \
    --enable-prefix-caching \
    --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1"}'

LMCache spills cold prefix-cache entries to CPU RAM and disk. Useful for retrieval-heavy workloads where the same passages get reused across many users.

Speculative Decoding {#speculative}

Draft tokens with a small model, verify with the big one. See CUDA Optimization for the theory.

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5

Pair models with identical tokenizers. Llama 3.1 70B + Llama 3.2 1B is the canonical pair (same vocab). Expected speedup: 1.5-2.5x at single-user batch size 1; less under high concurrency.

EAGLE / EAGLE-2

vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --speculative-model yuhuili/EAGLE-LLaMA3.1-Instruct-8B \
    --speculative-draft-tensor-parallel-size 1

EAGLE is faster than vanilla speculative decoding (2.5-3.5x) because it shares the target model's hidden states.

N-gram (no draft model)

--speculative-model "[ngram]" --ngram-prompt-lookup-max 4

Looks for repeated n-grams from the prompt itself. Excellent for code generation and RAG where output reuses prompt content. Free speedup with zero extra memory.

Tool Calling & Structured Outputs {#tool-calling}

Tool / function calling

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Weather in Tokyo?"}],
    tools=tools,
)

vLLM supports tool calling for Llama 3.1+, Hermes, Mistral, and Qwen 2.5+. Pass --enable-auto-tool-choice --tool-call-parser <model_family> to enable parsing.

Structured outputs (JSON Schema)

resp = client.chat.completions.create(
    model="...",
    messages=[...],
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "person", "schema": {...}},
    },
)

vLLM uses xgrammar (default) or outlines for constrained generation. Output is guaranteed schema-valid. Throughput overhead is 5-15%.

--guided-decoding-backend xgrammar     # default, fastest
--guided-decoding-backend outlines     # alternative, sometimes more compatible

Docker Deployment {#docker}

Single-container (simple)

docker run -d --name vllm \
    --restart unless-stopped \
    --runtime nvidia --gpus all \
    --ipc=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 16384 \
    --enable-prefix-caching \
    --gpu-memory-utilization 0.92

docker-compose.yml

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    ipc: host
    restart: unless-stopped
    environment:
      HF_TOKEN: ${HF_TOKEN}
    volumes:
      - hf-cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Llama-3.1-8B-Instruct
      --max-model-len 16384
      --enable-prefix-caching
      --gpu-memory-utilization 0.92
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  hf-cache:

Pre-pulling models

Models cached in ~/.cache/huggingface survive container restarts. Pre-download once on the host:

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct

Kubernetes & KServe {#kubernetes}

Bare Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-8b
spec:
  replicas: 1
  selector:
    matchLabels: { app: vllm-llama-8b }
  template:
    metadata:
      labels: { app: vllm-llama-8b }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model
            - meta-llama/Llama-3.1-8B-Instruct
            - --max-model-len
            - "16384"
            - --enable-prefix-caching
            - --gpu-memory-utilization
            - "0.92"
          ports:
            - { name: http, containerPort: 8000 }
          env:
            - name: HF_TOKEN
              valueFrom: { secretKeyRef: { name: hf-token, key: token } }
          resources:
            limits: { nvidia.com/gpu: "1" }
          readinessProbe:
            httpGet: { path: /health, port: http }
            initialDelaySeconds: 60
            periodSeconds: 10
          volumeMounts:
            - { name: hf-cache, mountPath: /root/.cache/huggingface }
            - { name: shm, mountPath: /dev/shm }
      volumes:
        - name: hf-cache
          persistentVolumeClaim: { claimName: hf-cache-pvc }
        - name: shm
          emptyDir: { medium: Memory, sizeLimit: 16Gi }

The /dev/shm volume is critical — without it, multi-worker setups fail with cryptic CUDA IPC errors.

KServe InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-1-8b
spec:
  predictor:
    model:
      modelFormat: { name: vllm }
      args:
        - --max-model-len=16384
        - --enable-prefix-caching
      runtime: kserve-vllmserver
      storageUri: hf://meta-llama/Llama-3.1-8B-Instruct
      resources:
        limits: { nvidia.com/gpu: "1" }

KServe handles autoscaling (KEDA / native), traffic splitting for canary deploys, and GitOps integration.

Horizontal Pod Autoscaler

vLLM exposes vllm:num_requests_running and vllm:num_requests_waiting via Prometheus. Scale on queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: vllm-hpa }
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-llama-8b
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric: { name: vllm_num_requests_waiting }
        target: { type: AverageValue, averageValue: "5" }

Observability: Prometheus, Tracing, Logging {#observability}

Prometheus metrics

vLLM exposes /metrics natively. Scrape every 15s.

Key metrics:

Metric	Meaning
`vllm:num_requests_running`	Currently decoding
`vllm:num_requests_waiting`	Queued
`vllm:gpu_cache_usage_perc`	KV-cache utilization
`vllm:time_to_first_token_seconds`	TTFT histogram
`vllm:time_per_output_token_seconds`	Per-token latency
`vllm:e2e_request_latency_seconds`	Total request latency
`vllm:prompt_tokens_total`	Aggregate prompt tokens
`vllm:generation_tokens_total`	Aggregate generated tokens

Grafana dashboard

Import dashboard ID 19655 (community vLLM dashboard). Key panels: TTFT p50/p95/p99, throughput tok/s, KV-cache utilization, queue depth.

OpenTelemetry tracing

vllm serve <model> --otlp-traces-endpoint http://otel-collector:4317

Traces include per-request prefill/decode timings, scheduling delays, and inter-token gaps.

Structured logging

export VLLM_LOGGING_CONFIG_PATH=./logging.json

Configure JSON logs for log aggregators (Loki, Elasticsearch, Datadog).

Authentication, Rate Limiting, Multi-Tenancy {#auth}

vLLM does not have built-in auth — put it behind a gateway.

Pattern: LiteLLM proxy

# litellm_config.yaml
model_list:
  - model_name: llama-3-1-8b
    litellm_params:
      model: openai/meta-llama/Llama-3.1-8B-Instruct
      api_base: http://vllm:8000/v1
      api_key: dummy

litellm_settings:
  master_key: sk-litellm-master
  database_url: postgresql://...

general_settings:
  enforce_user_param: true
  budget_duration: 30d

LiteLLM handles per-key rate limits, budget tracking, fallbacks, and request routing across multiple vLLM replicas.

Pattern: Envoy / Istio

For Kubernetes-native auth, use Istio AuthorizationPolicy or Envoy filters with API key headers. Combine with Ollama rate limiting patterns adapted to vLLM.

Benchmarking Your Deployment {#benchmarking}

Built-in benchmark scripts

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks

python benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1000 \
    --request-rate 10

Outputs:

Throughput (req/s and tok/s)
TTFT p50/p95/p99
ITL (inter-token latency) p50/p95/p99
Acceptance rate (with speculative decoding)

Real benchmarks

RTX 4090 (24GB), Llama 3.1 8B AWQ, FP8 KV cache:

Concurrency	TTFT p50	TTFT p99	Tok/s aggregate	GPU util
1	35 ms	55 ms	142	78%
4	48 ms	110 ms	480	95%
16	95 ms	280 ms	1,150	99%
32	170 ms	510 ms	1,720	99%
64	320 ms	1,100 ms	2,200	99%

Tradeoff: 64 concurrent requests gets 15x more aggregate throughput than 1 — but tail latency degrades. Set HPA targets to keep p95 TTFT under 500ms.

Tuning Recipes by GPU {#tuning-recipes}

RTX 3090 / 3090 Ti (24GB Ampere)

vllm serve casperhansen/llama-3.1-8b-instruct-awq \
    --quantization awq \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256

RTX 4090 (24GB Ada)

vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.93 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 16384

RTX 5090 (32GB Blackwell)

vllm serve casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.94 \
    --enable-prefix-caching \
    --enable-chunked-prefill

2x RTX 4090 (48GB total, PCIe)

vllm serve neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching

8x H100 (640GB total)

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 \
    --quantization fp8 \
    --tensor-parallel-size 8 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.94 \
    --enable-prefix-caching \
    --enable-chunked-prefill

Common Errors & Fixes {#troubleshooting}

Error	Cause	Fix
`CUDA out of memory` at startup	Model + KV cache exceeds VRAM	Lower `--max-model-len`, use AWQ/FP8, or reduce `--gpu-memory-utilization`
`CUDA out of memory` mid-request	Too many concurrent sequences	Lower `--max-num-seqs` or `--max-num-batched-tokens`
`AssertionError: Cannot find model in HF cache`	Not authenticated	`huggingface-cli login` or set `HF_TOKEN`
`RuntimeError: NCCL error`	TP across GPUs without proper IPC	Add `--ipc=host` to docker run; mount `/dev/shm` in K8s
`Speculative decoding accept rate < 50%`	Mismatched draft/target tokenizer	Use same model family for draft and target
`TTFT > 5s on first request`	Cold start, kernel autotune	Warm up with 5-10 dummy requests after start
`Throughput drops after hours of uptime`	KV-cache fragmentation (rare)	Restart; tune `--block-size 32` for long-context workloads
`Pip install fails with torch conflict`	System torch version mismatch	Fresh venv; never install over system torch

FAQ {#faq}

See answers to common vLLM questions below.

Sources: vLLM documentation | PagedAttention paper (Kwon et al., 2023) | vLLM GitHub | NeuralMagic FP8 model collection | Internal benchmarks on RTX 3090, 4090, 5090, A100, H100.

Related guides on Local AI Master:

vLLM Complete Setup Guide for Local LLMs (2026): Install, Tune, Serve

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What vLLM Is, In One Page {#what-vllm-is}

vLLM vs Ollama vs llama.cpp vs TensorRT-LLM {#comparison}

System Requirements {#requirements}

Reading articles is good. Building is better.

Installation {#installation}

Option 1: pip (most common)

Option 2: Docker (recommended for production)

Option 3: from source (custom CUDA, AMD ROCm, custom kernels)

Option 4: AMD ROCm (experimental)

Your First Model in 60 Seconds {#first-model}

OpenAI-Compatible API {#openai-api}

Python (official OpenAI SDK)

Endpoints

Streaming, tool calls, structured outputs

Quantization: AWQ, GPTQ, FP8, INT8 {#quantization}

AWQ-INT4 (best general purpose)

FP8 (Ada / Blackwell / Hopper only)

GPTQ-INT4

W8A8 INT8 (SmoothQuant / RTN)

Quantization decision tree

PagedAttention & KV-Cache Tuning {#paged-attention}

Block size

GPU memory utilization

KV-cache dtype

Max model length

Continuous Batching & Scheduling {#continuous-batching}

Priority scheduling

Tensor Parallelism & Multi-GPU {#tensor-parallel}

Pipeline parallel (multi-node)

NCCL tuning for multi-GPU

Prefix Caching & Chunked Prefill {#prefix-caching}

Prefix caching

Chunked prefill

Disk-based KV cache (LMCache)

Speculative Decoding {#speculative}

EAGLE / EAGLE-2

N-gram (no draft model)

Tool Calling & Structured Outputs {#tool-calling}

Tool / function calling

Structured outputs (JSON Schema)

Docker Deployment {#docker}

Single-container (simple)

docker-compose.yml

Pre-pulling models

Kubernetes & KServe {#kubernetes}

Bare Deployment

KServe InferenceService

Horizontal Pod Autoscaler

Observability: Prometheus, Tracing, Logging {#observability}

Prometheus metrics

Grafana dashboard

OpenTelemetry tracing

Structured logging

Authentication, Rate Limiting, Multi-Tenancy {#auth}

Pattern: LiteLLM proxy

Pattern: Envoy / Istio

Benchmarking Your Deployment {#benchmarking}

Built-in benchmark scripts

Real benchmarks

Tuning Recipes by GPU {#tuning-recipes}

RTX 3090 / 3090 Ti (24GB Ampere)

RTX 4090 (24GB Ada)

RTX 5090 (32GB Blackwell)

2x RTX 4090 (48GB total, PCIe)

8x H100 (640GB total)

Common Errors & Fixes {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)