Question 1

What is Aphrodite Engine and how is it different from vLLM?

Accepted Answer

Aphrodite Engine (PygmalionAI/aphrodite-engine on GitHub) is a fork of vLLM tuned for community / roleplay / creative-writing use cases. It keeps vLLM's PagedAttention and continuous batching for high throughput, but adds: every modern sampler (DRY, XTC, mirostat, dynamic temperature, smoothing, skew, repetition penalty range), broader quantization support (EXL2, GGUF, AWQ, GPTQ, FP8, INT8 on top of vLLM's native formats), KoboldAI-compatible API alongside OpenAI, multi-LoRA at runtime, and prompt logits API. If you are running a multi-user roleplay / creative writing service, Aphrodite is vLLM but with the things vLLM is missing.

Question 2

Should I use Aphrodite or vLLM?

Accepted Answer

For pure production high-throughput single-purpose LLM serving with major models in well-known formats (BF16 / FP8 / AWQ): vLLM is more battle-tested. For community / roleplay / creative serving where you need DRY, XTC, mirostat, EXL2 quants, or KoboldAI API compatibility (SillyTavern integration): Aphrodite. Performance is comparable — Aphrodite tracks vLLM upstream closely. Maintenance is smaller-team, so feature lag in a few areas (newest model architectures, niche FP8 paths) does happen.

Question 3

What hardware does Aphrodite Engine need?

Accepted Answer

NVIDIA GPU with compute capability 7.0+ (Volta and newer). 16 GB VRAM recommended for 8B models in BF16; 24 GB for 32B AWQ; 48 GB for 70B AWQ TP=2. AMD ROCm support exists via the upstream vLLM-rocm path. CPU-only is not supported. Docker is the easiest deployment path — the official `alpindale/aphrodite-openai:latest` image bundles all dependencies.

Question 4

Can Aphrodite serve EXL2 and GGUF in addition to AWQ / FP8?

Accepted Answer

Yes — that is one of its differentiators. Pass `--quantization exl2` for ExLlamaV2 quants (single-GPU only), `--quantization gguf` for llama.cpp GGUF files, in addition to the standard AWQ / GPTQ / FP8 / INT8 formats vLLM supports. Throughput on EXL2 / GGUF in Aphrodite is competitive with native ExLlamaV2 / llama.cpp respectively, with the bonus of PagedAttention continuous batching for multi-user.

Question 5

How do I expose Aphrodite as a SillyTavern backend?

Accepted Answer

Aphrodite ships both OpenAI and KoboldAI-compatible APIs simultaneously. SillyTavern: API → "Text Completion" → API type "Aphrodite" (or "KoboldAI") → URL `http://localhost:2242`. Or use OpenAI-compatible: API type "OpenAI" → URL `http://localhost:2242/v1`. The KoboldAI mode exposes more samplers; OpenAI mode is more portable. For details on samplers see [LLM Sampling Parameters Explained](/blog/llm-sampling-parameters-explained).

Question 6

Does Aphrodite support multi-LoRA serving?

Accepted Answer

Yes — load multiple LoRA adapters at startup with `--lora-modules name1=path1 name2=path2 ...`. Per-request LoRA selection via the API: `{"model": "base-model", "lora_request": {"name": "name1"}}`. Useful for serving many fine-tunes from one base engine without separate deployments. Same mechanism as vLLM but with broader sampler support per request.

Question 7

How do I tune Aphrodite for roleplay / creative writing throughput?

Accepted Answer

Combine: long context (\`--max-model-len 32768\`), prefix caching (\`--enable-prefix-caching\`), chunked prefill (\`--enable-chunked-prefill\`), and sampling presets that include DRY (\`--dry-multiplier 0.8 --dry-base 1.75\`) and XTC (\`--xtc-threshold 0.1 --xtc-probability 0.5\`) on creative endpoints. For multi-user roleplay servers, set \`--max-num-seqs 256\` for concurrent users and \`--max-num-batched-tokens 8192\` to balance latency vs throughput. KV-cache FP8 (\`--kv-cache-dtype fp8_e4m3\`) on Ada+ helps free VRAM for more concurrent sessions.

Question 8

Is Aphrodite production-stable?

Accepted Answer

Production-stable for the documented model families and quantization formats. Newer / niche model architectures may lag vLLM upstream by 2-8 weeks. Maintenance is by a smaller team than vLLM, so issue triage is slower. For mission-critical serving with strict SLAs and latest-model requirements, vLLM is the safer choice. For the long tail of community use cases (roleplay, creative, niche quants, multi-LoRA) Aphrodite has the better feature/effort ratio.

Need	Choose
High-throughput production with strict SLAs	vLLM
Newest models the day they release	vLLM
Roleplay / creative writing service	Aphrodite
Need DRY / XTC / mirostat samplers	Aphrodite
Need EXL2 or GGUF support	Aphrodite
KoboldAI-compatible API for SillyTavern users	Aphrodite
Multi-LoRA per-request	Both work
Lowest single-stream latency	TensorRT-LLM
Single-GPU INT4 max speed	ExLlamaV2 / TabbyAPI

Component	Minimum	Recommended
GPU	NVIDIA CC 7.0+	RTX 30/40/50, A100, H100
VRAM (8B class)	16 GB BF16 / 8 GB AWQ	16 GB+
VRAM (70B class)	48 GB BF16 / 24 GB AWQ	48 GB+
Driver	535+	555+
CUDA	12.1+	12.4+
Python	3.9	3.10-3.12
OS	Linux (Ubuntu 22.04+)	Ubuntu 22.04 LTS

Endpoint	API
`/v1/chat/completions`	OpenAI
`/v1/completions`	OpenAI
`/v1/embeddings`	OpenAI
`/api/v1/generate`	KoboldAI
`/api/extra/generate/stream`	KoboldAI streaming
`/api/v1/info`	KoboldAI metadata

Symptom	Cause	Fix
EXL2 quantization fails to load	EXL2 single-GPU only	Don't combine with --tensor-parallel-size > 1
GGUF perf lower than llama.cpp	Aphrodite's GGUF kernel	Trade-off: get continuous batching at slight per-token cost
Multi-LoRA OOM	Too many adapters loaded	Lower --max-loras
Sampler ignored	Wrong API endpoint	Use Aphrodite-specific fields, not OpenAI subset
Latest model not supported	Architecture not in Aphrodite yet	Use vLLM upstream until merged

Aphrodite Engine Setup Guide (2026): vLLM Fork with Every Modern Sampler

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Aphrodite Engine Is {#what-it-is}

Aphrodite vs vLLM Decision Matrix {#vs-vllm}

Hardware Requirements {#requirements}

Reading articles is good. Building is better.

Installation: pip, Docker, Kubernetes {#installation}

pip

Docker

Kubernetes

Your First Model {#first-model}

Quantization Support: AWQ, GPTQ, FP8, EXL2, GGUF {#quantization}

The Full Sampler Stack {#samplers}

OpenAI + KoboldAI Dual API {#dual-api}

Multi-LoRA at Runtime {#multi-lora}

Tensor Parallel & Multi-GPU {#tensor-parallel}

Long Context, Prefix Caching, Chunked Prefill {#long-context}

Vision Models {#vision}

SillyTavern Integration {#sillytavern}

Tuning Recipes {#tuning}

Multi-user roleplay server (RTX 4090)

High-throughput multi-GPU 70B

EXL2 single-GPU creative server

Authentication and Rate Limiting {#auth}

Observability {#observability}

Common Errors {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

vLLM Complete Setup Guide

LLM Sampling Parameters Explained

KoboldCpp Setup Guide

ExLlamaV2 + TabbyAPI Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI