Aphrodite Engine Setup Guide (2026): vLLM Fork with Every Modern Sampler
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Aphrodite Engine is what vLLM looks like when you give it every sampler you've ever wanted, broaden the quantization support to include EXL2 and GGUF, add KoboldAI API compatibility, and let you hot-swap LoRAs per request. Built and maintained by PygmalionAI, it serves the community / roleplay / creative-writing use case that mainline vLLM doesn't prioritize — without giving up vLLM's PagedAttention and continuous batching throughput.
This guide covers everything: installation across pip / Docker / Kubernetes, the OpenAI + KoboldAI dual API, every sampler (with sane presets for chat / code / creative), broader quantization support (EXL2, GGUF on top of AWQ / GPTQ / FP8), multi-LoRA serving, SillyTavern integration, and production tuning.
Table of Contents
- What Aphrodite Engine Is
- Aphrodite vs vLLM Decision Matrix
- Hardware Requirements
- Installation: pip, Docker, Kubernetes
- Your First Model
- Quantization Support: AWQ, GPTQ, FP8, EXL2, GGUF
- The Full Sampler Stack
- OpenAI + KoboldAI Dual API
- Multi-LoRA at Runtime
- Tensor Parallel & Multi-GPU
- Long Context, Prefix Caching, Chunked Prefill
- Vision Models
- SillyTavern Integration
- Tuning Recipes
- Authentication and Rate Limiting
- Observability
- Common Errors
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Aphrodite Engine Is {#what-it-is}
Aphrodite Engine is a fork of vLLM with:
- PagedAttention + continuous batching (inherited from vLLM)
- Every modern sampler — temperature, top-k/p, min-p, typical-p, smoothing, dynamic temperature, mirostat (v1/v2), DRY, XTC, repetition / presence / frequency penalties, skew sampling, no-repeat ngram
- Broader quantization — AWQ, GPTQ, FP8, INT8, plus EXL2 and GGUF
- KoboldAI-compatible API alongside OpenAI
- Multi-LoRA at runtime with per-request adapter selection
- Logits / logprobs API for advanced clients
- Speculative decoding (vanilla draft + EAGLE)
Project: github.com/PygmalionAI/aphrodite-engine. AGPL-3.0 license. Maintained primarily by alpindale and contributors.
Aphrodite vs vLLM Decision Matrix {#vs-vllm}
| Need | Choose |
|---|---|
| High-throughput production with strict SLAs | vLLM |
| Newest models the day they release | vLLM |
| Roleplay / creative writing service | Aphrodite |
| Need DRY / XTC / mirostat samplers | Aphrodite |
| Need EXL2 or GGUF support | Aphrodite |
| KoboldAI-compatible API for SillyTavern users | Aphrodite |
| Multi-LoRA per-request | Both work |
| Lowest single-stream latency | TensorRT-LLM |
| Single-GPU INT4 max speed | ExLlamaV2 / TabbyAPI |
Aphrodite tracks vLLM upstream closely (~weeks behind). For most non-bleeding-edge use cases either works.
Hardware Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA CC 7.0+ | RTX 30/40/50, A100, H100 |
| VRAM (8B class) | 16 GB BF16 / 8 GB AWQ | 16 GB+ |
| VRAM (70B class) | 48 GB BF16 / 24 GB AWQ | 48 GB+ |
| Driver | 535+ | 555+ |
| CUDA | 12.1+ | 12.4+ |
| Python | 3.9 | 3.10-3.12 |
| OS | Linux (Ubuntu 22.04+) | Ubuntu 22.04 LTS |
AMD ROCm via the vLLM-rocm path. Apple / WSL2 / Windows native are not supported.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: pip, Docker, Kubernetes {#installation}
pip
python3.11 -m venv ~/venvs/aphrodite
source ~/venvs/aphrodite/bin/activate
pip install aphrodite-engine
Docker
docker run -d --gpus all --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 2242:2242 \
alpindale/aphrodite-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 16384
Default port 2242 (KoboldAI legacy) — also exposes OpenAI on /v1.
Kubernetes
Use a vanilla Deployment with the alpindale/aphrodite-openai image; same pattern as the vLLM Kubernetes section. Don't forget /dev/shm and --ipc=host.
Your First Model {#first-model}
aphrodite run meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill
Test:
# OpenAI API
curl http://localhost:2242/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role":"user","content":"hi"}]}'
# KoboldAI API
curl http://localhost:2242/api/v1/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello", "max_length": 100}'
Quantization Support: AWQ, GPTQ, FP8, EXL2, GGUF {#quantization}
# AWQ-INT4
aphrodite run casperhansen/llama-3.1-8b-instruct-awq --quantization awq
# FP8 (Ada / Hopper / Blackwell)
aphrodite run neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 --kv-cache-dtype fp8_e4m3
# GPTQ
aphrodite run TheBloke/Llama-2-13B-chat-GPTQ --quantization gptq
# EXL2 (single-GPU only)
aphrodite run turboderp/Llama-3.1-70B-Instruct-exl2 \
--quantization exl2 --revision 4_0bpw
# GGUF (llama.cpp format)
aphrodite run --model bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--gguf-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--quantization gguf
EXL2 + GGUF are the two formats vLLM doesn't support (well). For background on the formats see AWQ vs GPTQ vs GGUF.
The Full Sampler Stack {#samplers}
Per-request samplers (extension fields beyond OpenAI):
{
"model": "...",
"messages": [...],
"temperature": 0.7,
"top_p": 0.9,
"top_k": -1,
"min_p": 0.05,
"typical_p": 1.0,
"smoothing_factor": 0.0,
"smoothing_curve": 1.0,
"dynatemp_min": 0.0,
"dynatemp_max": 0.0,
"dynatemp_exponent": 1.0,
"mirostat_mode": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.1,
"repetition_penalty": 1.05,
"no_repeat_ngram_size": 0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"skew": 0.0,
"dry_multiplier": 0.8,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_penalty_last_n": 0,
"dry_sequence_breakers": ["\n", ":", "\""],
"xtc_threshold": 0.1,
"xtc_probability": 0.5,
"guided_json": {...},
"guided_regex": "...",
"guided_grammar": "..."
}
Refer to LLM Sampling Parameters Explained for what each does and recommended presets.
OpenAI + KoboldAI Dual API {#dual-api}
| Endpoint | API |
|---|---|
/v1/chat/completions | OpenAI |
/v1/completions | OpenAI |
/v1/embeddings | OpenAI |
/api/v1/generate | KoboldAI |
/api/extra/generate/stream | KoboldAI streaming |
/api/v1/info | KoboldAI metadata |
KoboldAI clients (older SillyTavern modes, KoboldAI Lite, KoboldClient) hit the /api/v1/... endpoints. OpenAI clients hit /v1/.... Both work simultaneously on the same port.
Multi-LoRA at Runtime {#multi-lora}
Start with multiple LoRAs:
aphrodite run meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 8 \
--max-lora-rank 64 \
--lora-modules \
story-lora=./loras/story \
code-lora=./loras/code \
legal-lora=./loras/legal
Per-request LoRA via API:
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [...],
"extra_body": {"lora_request": {"name": "story-lora"}}
}
Per-request adapter swap takes microseconds — useful for serving many fine-tunes from one base engine.
Tensor Parallel & Multi-GPU {#tensor-parallel}
# 2x RTX 4090
aphrodite run casperhansen/llama-3.1-70b-instruct-awq \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 16384
For 8x H100 405B, see the vLLM tensor parallel section — same flags work in Aphrodite.
Long Context, Prefix Caching, Chunked Prefill {#long-context}
aphrodite run meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 131072 \
--enable-prefix-caching \
--enable-chunked-prefill \
--kv-cache-dtype fp8_e4m3 \
--gpu-memory-utilization 0.94
Prefix caching is mandatory for agent / RAG workloads with repeated system prompts — same gains as in vLLM (10-100x lower TTFT for cached prefixes).
Vision Models {#vision}
aphrodite run llava-hf/llava-1.5-7b-hf
Compatible: LLaVA-1.5/1.6, Llama 3.2 11B / 90B Vision, Qwen 2-VL, MiniCPM-V, Pixtral. Pass image_url in OpenAI message format.
SillyTavern Integration {#sillytavern}
In SillyTavern: API → "Text Completion" → API type "Aphrodite" → URL http://<host>:2242. Or "OpenAI" → http://<host>:2242/v1. The Aphrodite mode exposes more samplers; OpenAI is more portable.
For full SillyTavern setup see KoboldCpp Setup Guide — same client config, just different backend.
Tuning Recipes {#tuning}
Multi-user roleplay server (RTX 4090)
aphrodite run casperhansen/llama-3.1-8b-instruct-awq \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-seqs 64 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8_e4m3
High-throughput multi-GPU 70B
aphrodite run neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--enable-prefix-caching \
--enable-chunked-prefill
EXL2 single-GPU creative server
aphrodite run turboderp/Llama-3.1-70B-Instruct-exl2 \
--quantization exl2 --revision 4_0bpw \
--max-model-len 16384 \
--gpu-memory-utilization 0.93 \
--enable-prefix-caching
Authentication and Rate Limiting {#auth}
Aphrodite has basic API key support:
aphrodite run <model> --api-keys sk-key1 sk-key2 sk-key3
For multi-tenant rate limits, put it behind LiteLLM or another gateway — same pattern as vLLM auth.
Observability {#observability}
/metrics exposes Prometheus metrics inherited from vLLM plus Aphrodite-specific ones (per-LoRA request counts, sampler usage). Pair with the vLLM Grafana dashboard (id 19655) — most metric names match.
Common Errors {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| EXL2 quantization fails to load | EXL2 single-GPU only | Don't combine with --tensor-parallel-size > 1 |
| GGUF perf lower than llama.cpp | Aphrodite's GGUF kernel | Trade-off: get continuous batching at slight per-token cost |
| Multi-LoRA OOM | Too many adapters loaded | Lower --max-loras |
| Sampler ignored | Wrong API endpoint | Use Aphrodite-specific fields, not OpenAI subset |
| Latest model not supported | Architecture not in Aphrodite yet | Use vLLM upstream until merged |
FAQ {#faq}
See answers to common Aphrodite Engine questions below.
Sources: Aphrodite Engine GitHub | PygmalionAI | vLLM upstream | Internal benchmarks RTX 4090 / 5090 / H100.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!