ExLlamaV2 + TabbyAPI Guide (2026): Best INT4 Inference on a Single GPU
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
ExLlamaV2 is the unsung hero of consumer-GPU local LLM inference. On a single RTX 3090 / 4090 / 5090, no other framework matches its speed for INT4 quantized models — not vLLM, not llama.cpp, not Ollama. The catch: it is single-GPU, single-user, NVIDIA-only, and barely documented for newcomers.
TabbyAPI puts an OpenAI-compatible HTTP server on top, making ExLlamaV2 production-usable. Together they are the right answer when you have one good NVIDIA GPU, run one user at a time, and want maximum tokens per second.
This guide covers everything: how EXL2 quantization works and why it beats AWQ / GGUF on quality-per-bit, installing TabbyAPI on Linux / Windows / Docker, picking the right bits-per-weight for your VRAM, long-context tuning with cache quantization, sampling presets, multi-GPU layer splits, and benchmarks against vLLM, llama.cpp, and TensorRT-LLM on the same hardware.
Table of Contents
- What ExLlamaV2 and TabbyAPI Are
- Why EXL2 Beats Other 4-Bit Formats on Quality
- Hardware & Software Requirements
- Installation: TabbyAPI
- Picking the Right EXL2 bpw for Your VRAM
- Downloading Pre-Quantized Models
- Quantizing Your Own Model to EXL2
- TabbyAPI Configuration
- Cache Quantization (Q4 / Q6 / Q8 / FP16)
- Long Context (32K-131K)
- Multi-GPU Layer Split
- Sampling & Prompt Templates
- LoRA Adapters
- Speculative Decoding
- Tool Calling / Function Calling
- Performance Benchmarks
- Tuning Recipes by GPU
- Common Mistakes & Fixes
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What ExLlamaV2 and TabbyAPI Are {#what-it-is}
ExLlamaV2 (turboderp's repo) is a CUDA-first inference library for INT4-quantized transformer models. It pairs hand-tuned dequant-fused matmul kernels with an aggressive batch-size-1 scheduler. The format it loads, EXL2, is its own quantization scheme — measurement-based, mixed-bit-width, optimized for inference on Ampere+ NVIDIA cards.
TabbyAPI (theroyallab/tabbyAPI) is the OpenAI-compatible server wrapping ExLlamaV2. It exposes /v1/chat/completions, /v1/completions, streaming, sampling presets, prompt templates, basic auth, and a YAML config.
Together, they are the right local inference stack when:
- You have one NVIDIA GPU (Ampere or newer).
- You serve one or two concurrent users.
- You want maximum tok/s.
For multi-user concurrent serving, see vLLM; for max single-stream latency on H100-class hardware, TensorRT-LLM; for AMD / Apple, llama.cpp / Ollama.
Why EXL2 Beats Other 4-Bit Formats on Quality {#exl2-quality}
Most 4-bit formats (GPTQ, AWQ, GGUF Q4_K_M) use a uniform bit width — every weight matrix gets the same number of bits. EXL2 instead allocates bits per layer based on measurement: it runs a calibration pass, measures how much each layer's weights affect output perplexity, and gives more bits to sensitive layers (attention QKV, output projection) and fewer to less-sensitive ones (later FFN layers).
Result: at the same average bits-per-weight, EXL2 produces lower perplexity than uniform-width quants.
Llama 3.1 70B perplexity on WikiText (lower is better):
| Format | Avg bits | Size | PPL |
|---|---|---|---|
| FP16 | 16.00 | 140 GB | 4.96 |
| EXL2 6.0bpw | 6.00 | 53 GB | 4.97 |
| EXL2 5.0bpw | 5.00 | 44 GB | 4.99 |
| AWQ-INT4 g128 | 4.25 | 36 GB | 5.01 |
| GGUF Q5_K_M | 5.66 | 50 GB | 5.00 |
| EXL2 4.5bpw | 4.50 | 40 GB | 5.00 |
| EXL2 4.0bpw | 4.00 | 35 GB | 5.05 |
| GGUF Q4_K_M | 4.83 | 42 GB | 5.06 |
| GGUF IQ4_XS | 4.25 | 38 GB | 5.04 |
| GPTQ-128g | 4.25 | 36 GB | 5.10 |
EXL2 4.0bpw is smaller than AWQ-INT4 g128 with comparable perplexity — both fit a 70B model in a 24 GB GPU with 4-6 GB left for KV cache. That 4-6 GB of KV makes a 70B model with 8K-16K context fully on-GPU practical on a single RTX 4090.
Hardware & Software Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| GPU | Compute capability 8.0+ (Ampere) | RTX 3090 / 4090 / 5090 / A6000 / L40S |
| VRAM (8B class) | 8 GB | 12 GB+ |
| VRAM (32B class) | 16 GB | 20 GB+ |
| VRAM (70B class) | 24 GB | 24 GB exactly is fine |
| Driver | 535+ | 555+ |
| CUDA | 12.1+ (build only; runtime via PyTorch) | 12.4+ |
| Python | 3.10 | 3.11-3.12 |
| OS | Linux, Windows | Ubuntu 22.04 LTS |
| RAM | 16 GB | 32 GB+ |
| Disk | 100 GB | NVMe |
Turing (RTX 20-series, GTX 16) is not supported — kernels require Ampere SM 8.0 features.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: TabbyAPI {#installation}
Linux / WSL2
git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pip
# Install with CUDA 12.4 wheels
./start.sh # auto-detects CUDA, picks right wheels
The start.sh script handles installing exllamav2 + flash-attn + dependencies with the right CUDA pins.
Windows
git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
.\start.bat
The start.bat script does the equivalent venv + install.
Docker
docker run -d --name tabbyapi \
--gpus all \
-p 5000:5000 \
-v $(pwd)/models:/app/models \
-v $(pwd)/config.yml:/app/config.yml \
ghcr.io/theroyallab/tabbyapi:latest
From source (latest)
pip install exllamav2
pip install flash-attn --no-build-isolation
pip install -r requirements.txt
python main.py --config config.yml
Default port 5000. Browse http://localhost:5000/docs for the FastAPI Swagger UI.
Picking the Right EXL2 bpw for Your VRAM {#picking-bpw}
Approximate memory budget = model_size × bpw / 8 + KV cache + overhead (~1-2 GB).
| GPU VRAM | Recommended Model + bpw |
|---|---|
| 8 GB | Llama 3.1 8B EXL2 4.0bpw, 4K ctx, Q4 cache |
| 12 GB | Llama 3.1 8B EXL2 6.0bpw, 16K ctx |
| 16 GB | Qwen 2.5 14B EXL2 5.0bpw, 16K ctx |
| 24 GB | Qwen 2.5 32B EXL2 5.0bpw / Llama 3.1 70B EXL2 2.4bpw / Llama 3.1 70B EXL2 4.0bpw with Q4 cache |
| 32 GB (5090) | Llama 3.1 70B EXL2 4.5bpw, 16K ctx, FP16 cache |
| 48 GB (A6000 / 2x 24GB) | Llama 3.1 70B EXL2 6.0bpw, 32K ctx |
| 80 GB (H100 / A100) | Llama 3.1 70B EXL2 8.0bpw or 405B EXL2 2.4bpw |
For 70B on exactly 24 GB: 4.0bpw + Q4 cache works at 8K context. 4.5bpw works at ~4K context. 5.0bpw and above will OOM.
Downloading Pre-Quantized Models {#downloading}
The community publishes EXL2 quants on Hugging Face. Top providers:
- turboderp — the author; smallest selection but highest quality.
- lonestriker — extensive coverage of popular models.
- bartowski — broad coverage, multiple bpw per model.
- LoneStriker (case-sensitive) — overlap with above.
# Use huggingface-cli
huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-exl2 \
--revision 4_0 \
--local-dir ./models/Llama-3.1-70B-4.0bpw
EXL2 repos use branch revisions for different bpw — --revision 4_0 selects the 4.0bpw branch. Always check the README for available branches.
Quantizing Your Own Model to EXL2 {#quantizing}
git clone https://github.com/turboderp-org/exllamav2
cd exllamav2
# Quantize Llama 3.1 8B to 4.65bpw average
python convert.py \
-i ./Llama-3.1-8B-Instruct \
-o ./Llama-3.1-8B-4.65bpw \
-nr \
-b 4.65 \
-hb 6 \
-c ./calibration.parquet
Flags:
-btarget average bpw (e.g., 4.65, 5.0, 6.5)-hbhead bits — output projection layer (keep at 6+ for quality)-ccalibration dataset (wikitext-2.parquetis the default; provide domain-specific data if your use case is narrow)-nrskip resuming existing measurement file
Time: Llama 3.1 8B = 30-60 min on RTX 4090; 70B = 2-4 hours; 405B = 12-24 hours.
For a domain-specific calibration set, format your data as one column text in a Parquet file, ~1024 chunks of ~2K tokens each. Custom calibration improves perplexity on in-domain data 1-3%.
TabbyAPI Configuration {#config}
config.yml:
network:
host: 0.0.0.0
port: 5000
disable_auth: false # set true for local dev only
logging:
log_prompt: false
log_generation_params: false
log_requests: false
model:
model_dir: models/Meta-Llama-3.1-70B-Instruct-exl2-4.0bpw
use_dummy_models: false
inline_model_loading: true
use_as_default: ['max_seq_len', 'cache_mode']
max_seq_len: 16384
override_base_seq_len:
cache_mode: Q4 # Q4, Q6, Q8, FP16
cache_size: 16384
chunk_size: 2048
prompt_template: # auto-detected from tokenizer_config; override here
# Multi-GPU
gpu_split_auto: true
autosplit_reserve: [96] # MB reserved on first GPU
gpu_split: # explicit override e.g. [22, 22]
# Optimizations
fasttensors: true
use_dummy_models: false
# Generation defaults
rope_scale: 1.0
rope_alpha: 1.0
draft_model:
draft_model_dir: # speculative decoding draft
draft_rope_scale: 1.0
embeddings:
embedding_model_dir:
embeddings_device: cpu
developer:
unsafe_launch: false
disable_request_streaming: false
cuda_malloc_backend: true
uvloop: true
realtime_process_priority: false
Run:
python main.py --config config.yml
Endpoints (OpenAI-compatible): /v1/chat/completions, /v1/completions. Plus /v1/model/load, /v1/model/unload for hot-swapping models.
Cache Quantization (Q4 / Q6 / Q8 / FP16) {#cache-quantization}
cache_mode | bytes/token (per layer) | Perplexity loss | Use Case |
|---|---|---|---|
| FP16 | 4 | 0% | Maximum quality, plenty of VRAM |
| Q8 | 2 | ~0.1% | Default for medium contexts |
| Q6 | 1.5 | ~0.3% | Good balance |
| Q4 | 1 | ~1.0% | Long context on tight VRAM |
For Llama 3.1 70B at 16K context: FP16 cache = ~5 GB, Q8 = ~2.5 GB, Q4 = ~1.25 GB. The savings from Q4 cache often allow you to fit a higher-bpw model on the same GPU.
Real example on RTX 4090 with 70B at 16K:
- 4.0bpw + Q4 cache → fits with 1 GB free
- 4.5bpw + Q4 cache → just barely fits
- 4.0bpw + Q6 cache → fits with ~500 MB free
- 4.0bpw + FP16 cache → OOM
Long Context (32K-131K) {#long-context}
model:
max_seq_len: 131072
cache_mode: Q4
cache_size: 131072
chunk_size: 4096 # prompt processing chunk
For RoPE scaling on models that need it (e.g., scaling Llama 3 8K to 32K):
model:
rope_scale: 4.0 # linear interpolation
rope_alpha: 1.0 # NTK-aware
Llama 3.1 (8K → 131K) and Qwen 2.5 (32K → 131K with YaRN) handle scaling natively in the model — no additional flags. For Llama 3 (the older one) extending past 8K, set rope_scale and rope_alpha based on YaRN or NTK-aware formulas.
Multi-GPU Layer Split {#multi-gpu}
ExLlamaV2 splits layers across GPUs (pipeline-parallel-like, not tensor-parallel).
model:
gpu_split_auto: true
autosplit_reserve: [96, 96] # reserve 96 MB on each GPU
Or explicit:
model:
gpu_split_auto: false
gpu_split: [22, 22] # 22 GB on GPU0, 22 GB on GPU1
For 2x RTX 3090 with NVLink, expect 1.4-1.6x of single-3090 throughput on 70B 4.5bpw. Without NVLink (RTX 4090 / 5090), ~1.3-1.5x. Inferior to vLLM tensor parallel, but simpler.
Sampling & Prompt Templates {#sampling}
TabbyAPI exposes the full sampler stack via the OpenAI-compatible API plus extension fields:
{
"model": "Llama-3.1-70B",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.7,
"top_p": 0.9,
"min_p": 0.05,
"top_k": 0,
"repetition_penalty": 1.05,
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
"smoothing_factor": 0.0,
"skew": 0.0,
"min_tokens": 0,
"max_tokens": 1024,
"stop": ["<|eot_id|>"],
"stream": true
}
DRY and XTC are not yet in TabbyAPI as of mid-2026 — for those, use oobabooga's text-generation-webui or KoboldCpp. See LLM Sampling Parameters for what each parameter does.
Prompt templates
TabbyAPI auto-detects prompt templates from tokenizer_config.json (Jinja). Override with prompt_template in the model section if needed. Supplied templates: alpaca, chatml, llama3, mistral, vicuna, phi3, etc., in templates/.
LoRA Adapters {#lora}
loras:
- name: my-style
scaling: 1.0
Place the LoRA folder under loras/my-style/. ExLlamaV2 supports peft-format LoRAs converted with the exllamav2.lora utilities.
Per-request LoRA via the API:
{
"model": "...",
"messages": [...],
"lora_request": {"name": "my-style", "scaling": 1.0}
}
Multiple LoRAs can be loaded simultaneously and selected per request.
Speculative Decoding {#speculative}
draft_model:
draft_model_dir: models/Llama-3.2-1B-Instruct-exl2-4.5bpw
draft_rope_scale: 1.0
draft_cache_mode: Q4
Pair Llama 3.1 70B target with Llama 3.2 1B draft (same vocab). Expected speedup: 1.5-2.0x at batch size 1 with high acceptance rate. See CUDA Optimization for theory.
Tool Calling / Function Calling {#tool-calling}
TabbyAPI supports OpenAI-style tools for any chat-tuned model with a tool template (Llama 3.1+, Qwen 2.5+, Mistral Large 2). Pass tools and tool_choice as in the OpenAI spec; TabbyAPI handles parsing the model output and returning structured tool_calls in the response.
For consistent JSON output without grammars, use temperature ≤ 0.3 and pair with a guidance/outlines wrapper at the application layer.
Performance Benchmarks {#benchmarks}
Single user, batch size 1, 128 output tokens, 4K context, RTX 4090.
Llama 3.1 8B Q4-class
| Framework | tok/s |
|---|---|
| Ollama (Q4_K_M) | 127 |
| llama.cpp (Q4_K_M, FA) | 130 |
| vLLM (AWQ-INT4) | 155 |
| TabbyAPI / ExLlamaV2 (EXL2 4.0bpw) | 165 |
| TensorRT-LLM (AWQ) | 178 |
Llama 3.1 70B Q4-class
| Framework | tok/s | VRAM |
|---|---|---|
| Ollama (Q4_K_M) | 8 (offload) | 24+15 |
| llama.cpp (Q4_K_M, FA) | 9 (offload) | 24+15 |
| vLLM (AWQ-INT4) | OOM | 24 |
| TabbyAPI / ExLlamaV2 (EXL2 4.0bpw + Q4 cache) | 22 | 24 |
| 2x RTX 4090 vLLM (AWQ TP=2) | 38 | 48 |
| 2x RTX 4090 TabbyAPI (EXL2 5.0bpw) | 31 | 48 |
For single-GPU 70B inference, TabbyAPI is the only practical option at usable speeds.
Tuning Recipes by GPU {#tuning}
RTX 3090 (24 GB)
model:
model_dir: models/Llama-3.1-70B-EXL2-4.0bpw
max_seq_len: 8192
cache_mode: Q4
cache_size: 8192
chunk_size: 2048
fasttensors: true
RTX 4090 (24 GB)
Same as 3090 but max_seq_len: 16384 works comfortably with Q4 cache.
RTX 5090 (32 GB)
model:
model_dir: models/Llama-3.1-70B-EXL2-4.5bpw
max_seq_len: 16384
cache_mode: Q6
cache_size: 16384
A6000 / RTX 6000 Ada (48 GB)
model:
model_dir: models/Llama-3.1-70B-EXL2-6.0bpw
max_seq_len: 32768
cache_mode: Q8
2x RTX 3090 NVLink
model:
gpu_split: [22, 22]
model_dir: models/Llama-3.1-70B-EXL2-5.0bpw
max_seq_len: 16384
cache_mode: Q6
Common Mistakes & Fixes {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM at load | bpw + cache too large | Lower bpw or cache_mode |
| OOM mid-generation | Long context grew KV | Lower max_seq_len or use Q4 cache |
| Slow first token | Prompt prefill | Set chunk_size higher (4096) |
| Garbled output | Wrong prompt template | Set explicit prompt_template: llama3 |
| Multi-GPU slower than single | NVLink not used | Verify NVLink with nvidia-smi nvlink -s |
| flash-attn install fails | Wrong CUDA / wheel | Use ./start.sh which pins versions |
| Model loads but errors on first request | Tokenizer mismatch | Re-download repo, ensure tokenizer files present |
| Repetition / loops | Sampling too narrow | Increase min_p to 0.05, add rep_penalty 1.05 |
FAQ {#faq}
See answers to common ExLlamaV2 / TabbyAPI questions below.
Sources: ExLlamaV2 GitHub | TabbyAPI GitHub | bartowski's EXL2 quants on HF | turboderp's quants | Internal benchmarks RTX 3090, 4090, 5090.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!