★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Production

Aphrodite Engine Setup Guide (2026): vLLM Fork with Every Modern Sampler

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Aphrodite Engine is what vLLM looks like when you give it every sampler you've ever wanted, broaden the quantization support to include EXL2 and GGUF, add KoboldAI API compatibility, and let you hot-swap LoRAs per request. Built and maintained by PygmalionAI, it serves the community / roleplay / creative-writing use case that mainline vLLM doesn't prioritize — without giving up vLLM's PagedAttention and continuous batching throughput.

This guide covers everything: installation across pip / Docker / Kubernetes, the OpenAI + KoboldAI dual API, every sampler (with sane presets for chat / code / creative), broader quantization support (EXL2, GGUF on top of AWQ / GPTQ / FP8), multi-LoRA serving, SillyTavern integration, and production tuning.

Table of Contents

  1. What Aphrodite Engine Is
  2. Aphrodite vs vLLM Decision Matrix
  3. Hardware Requirements
  4. Installation: pip, Docker, Kubernetes
  5. Your First Model
  6. Quantization Support: AWQ, GPTQ, FP8, EXL2, GGUF
  7. The Full Sampler Stack
  8. OpenAI + KoboldAI Dual API
  9. Multi-LoRA at Runtime
  10. Tensor Parallel & Multi-GPU
  11. Long Context, Prefix Caching, Chunked Prefill
  12. Vision Models
  13. SillyTavern Integration
  14. Tuning Recipes
  15. Authentication and Rate Limiting
  16. Observability
  17. Common Errors
  18. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Aphrodite Engine Is {#what-it-is}

Aphrodite Engine is a fork of vLLM with:

  • PagedAttention + continuous batching (inherited from vLLM)
  • Every modern sampler — temperature, top-k/p, min-p, typical-p, smoothing, dynamic temperature, mirostat (v1/v2), DRY, XTC, repetition / presence / frequency penalties, skew sampling, no-repeat ngram
  • Broader quantization — AWQ, GPTQ, FP8, INT8, plus EXL2 and GGUF
  • KoboldAI-compatible API alongside OpenAI
  • Multi-LoRA at runtime with per-request adapter selection
  • Logits / logprobs API for advanced clients
  • Speculative decoding (vanilla draft + EAGLE)

Project: github.com/PygmalionAI/aphrodite-engine. AGPL-3.0 license. Maintained primarily by alpindale and contributors.


Aphrodite vs vLLM Decision Matrix {#vs-vllm}

NeedChoose
High-throughput production with strict SLAsvLLM
Newest models the day they releasevLLM
Roleplay / creative writing serviceAphrodite
Need DRY / XTC / mirostat samplersAphrodite
Need EXL2 or GGUF supportAphrodite
KoboldAI-compatible API for SillyTavern usersAphrodite
Multi-LoRA per-requestBoth work
Lowest single-stream latencyTensorRT-LLM
Single-GPU INT4 max speedExLlamaV2 / TabbyAPI

Aphrodite tracks vLLM upstream closely (~weeks behind). For most non-bleeding-edge use cases either works.


Hardware Requirements {#requirements}

ComponentMinimumRecommended
GPUNVIDIA CC 7.0+RTX 30/40/50, A100, H100
VRAM (8B class)16 GB BF16 / 8 GB AWQ16 GB+
VRAM (70B class)48 GB BF16 / 24 GB AWQ48 GB+
Driver535+555+
CUDA12.1+12.4+
Python3.93.10-3.12
OSLinux (Ubuntu 22.04+)Ubuntu 22.04 LTS

AMD ROCm via the vLLM-rocm path. Apple / WSL2 / Windows native are not supported.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: pip, Docker, Kubernetes {#installation}

pip

python3.11 -m venv ~/venvs/aphrodite
source ~/venvs/aphrodite/bin/activate

pip install aphrodite-engine

Docker

docker run -d --gpus all --ipc=host \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 2242:2242 \
    alpindale/aphrodite-openai:latest \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 16384

Default port 2242 (KoboldAI legacy) — also exposes OpenAI on /v1.

Kubernetes

Use a vanilla Deployment with the alpindale/aphrodite-openai image; same pattern as the vLLM Kubernetes section. Don't forget /dev/shm and --ipc=host.


Your First Model {#first-model}

aphrodite run meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill

Test:

# OpenAI API
curl http://localhost:2242/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role":"user","content":"hi"}]}'

# KoboldAI API
curl http://localhost:2242/api/v1/generate \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Hello", "max_length": 100}'

Quantization Support: AWQ, GPTQ, FP8, EXL2, GGUF {#quantization}

# AWQ-INT4
aphrodite run casperhansen/llama-3.1-8b-instruct-awq --quantization awq

# FP8 (Ada / Hopper / Blackwell)
aphrodite run neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 --kv-cache-dtype fp8_e4m3

# GPTQ
aphrodite run TheBloke/Llama-2-13B-chat-GPTQ --quantization gptq

# EXL2 (single-GPU only)
aphrodite run turboderp/Llama-3.1-70B-Instruct-exl2 \
    --quantization exl2 --revision 4_0bpw

# GGUF (llama.cpp format)
aphrodite run --model bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
    --gguf-file Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --quantization gguf

EXL2 + GGUF are the two formats vLLM doesn't support (well). For background on the formats see AWQ vs GPTQ vs GGUF.


The Full Sampler Stack {#samplers}

Per-request samplers (extension fields beyond OpenAI):

{
  "model": "...",
  "messages": [...],
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": -1,
  "min_p": 0.05,
  "typical_p": 1.0,
  "smoothing_factor": 0.0,
  "smoothing_curve": 1.0,
  "dynatemp_min": 0.0,
  "dynatemp_max": 0.0,
  "dynatemp_exponent": 1.0,
  "mirostat_mode": 0,
  "mirostat_tau": 5.0,
  "mirostat_eta": 0.1,
  "repetition_penalty": 1.05,
  "no_repeat_ngram_size": 0,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "skew": 0.0,
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "dry_penalty_last_n": 0,
  "dry_sequence_breakers": ["\n", ":", "\""],
  "xtc_threshold": 0.1,
  "xtc_probability": 0.5,
  "guided_json": {...},
  "guided_regex": "...",
  "guided_grammar": "..."
}

Refer to LLM Sampling Parameters Explained for what each does and recommended presets.


OpenAI + KoboldAI Dual API {#dual-api}

EndpointAPI
/v1/chat/completionsOpenAI
/v1/completionsOpenAI
/v1/embeddingsOpenAI
/api/v1/generateKoboldAI
/api/extra/generate/streamKoboldAI streaming
/api/v1/infoKoboldAI metadata

KoboldAI clients (older SillyTavern modes, KoboldAI Lite, KoboldClient) hit the /api/v1/... endpoints. OpenAI clients hit /v1/.... Both work simultaneously on the same port.


Multi-LoRA at Runtime {#multi-lora}

Start with multiple LoRAs:

aphrodite run meta-llama/Llama-3.1-8B-Instruct \
    --enable-lora \
    --max-loras 8 \
    --max-lora-rank 64 \
    --lora-modules \
        story-lora=./loras/story \
        code-lora=./loras/code \
        legal-lora=./loras/legal

Per-request LoRA via API:

{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [...],
    "extra_body": {"lora_request": {"name": "story-lora"}}
}

Per-request adapter swap takes microseconds — useful for serving many fine-tunes from one base engine.


Tensor Parallel & Multi-GPU {#tensor-parallel}

# 2x RTX 4090
aphrodite run casperhansen/llama-3.1-70b-instruct-awq \
    --quantization awq \
    --tensor-parallel-size 2 \
    --max-model-len 16384

For 8x H100 405B, see the vLLM tensor parallel section — same flags work in Aphrodite.


Long Context, Prefix Caching, Chunked Prefill {#long-context}

aphrodite run meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 131072 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --kv-cache-dtype fp8_e4m3 \
    --gpu-memory-utilization 0.94

Prefix caching is mandatory for agent / RAG workloads with repeated system prompts — same gains as in vLLM (10-100x lower TTFT for cached prefixes).


Vision Models {#vision}

aphrodite run llava-hf/llava-1.5-7b-hf

Compatible: LLaVA-1.5/1.6, Llama 3.2 11B / 90B Vision, Qwen 2-VL, MiniCPM-V, Pixtral. Pass image_url in OpenAI message format.


SillyTavern Integration {#sillytavern}

In SillyTavern: API → "Text Completion" → API type "Aphrodite" → URL http://<host>:2242. Or "OpenAI" → http://<host>:2242/v1. The Aphrodite mode exposes more samplers; OpenAI is more portable.

For full SillyTavern setup see KoboldCpp Setup Guide — same client config, just different backend.


Tuning Recipes {#tuning}

Multi-user roleplay server (RTX 4090)

aphrodite run casperhansen/llama-3.1-8b-instruct-awq \
    --quantization awq \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-seqs 64 \
    --max-num-batched-tokens 8192 \
    --kv-cache-dtype fp8_e4m3

High-throughput multi-GPU 70B

aphrodite run neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill

EXL2 single-GPU creative server

aphrodite run turboderp/Llama-3.1-70B-Instruct-exl2 \
    --quantization exl2 --revision 4_0bpw \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.93 \
    --enable-prefix-caching

Authentication and Rate Limiting {#auth}

Aphrodite has basic API key support:

aphrodite run <model> --api-keys sk-key1 sk-key2 sk-key3

For multi-tenant rate limits, put it behind LiteLLM or another gateway — same pattern as vLLM auth.


Observability {#observability}

/metrics exposes Prometheus metrics inherited from vLLM plus Aphrodite-specific ones (per-LoRA request counts, sampler usage). Pair with the vLLM Grafana dashboard (id 19655) — most metric names match.


Common Errors {#troubleshooting}

SymptomCauseFix
EXL2 quantization fails to loadEXL2 single-GPU onlyDon't combine with --tensor-parallel-size > 1
GGUF perf lower than llama.cppAphrodite's GGUF kernelTrade-off: get continuous batching at slight per-token cost
Multi-LoRA OOMToo many adapters loadedLower --max-loras
Sampler ignoredWrong API endpointUse Aphrodite-specific fields, not OpenAI subset
Latest model not supportedArchitecture not in Aphrodite yetUse vLLM upstream until merged

FAQ {#faq}

See answers to common Aphrodite Engine questions below.


Sources: Aphrodite Engine GitHub | PygmalionAI | vLLM upstream | Internal benchmarks RTX 4090 / 5090 / H100.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes an Aphrodite + SillyTavern reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators