★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Production

ExLlamaV2 + TabbyAPI Guide (2026): Best INT4 Inference on a Single GPU

May 1, 2026
28 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

ExLlamaV2 is the unsung hero of consumer-GPU local LLM inference. On a single RTX 3090 / 4090 / 5090, no other framework matches its speed for INT4 quantized models — not vLLM, not llama.cpp, not Ollama. The catch: it is single-GPU, single-user, NVIDIA-only, and barely documented for newcomers.

TabbyAPI puts an OpenAI-compatible HTTP server on top, making ExLlamaV2 production-usable. Together they are the right answer when you have one good NVIDIA GPU, run one user at a time, and want maximum tokens per second.

This guide covers everything: how EXL2 quantization works and why it beats AWQ / GGUF on quality-per-bit, installing TabbyAPI on Linux / Windows / Docker, picking the right bits-per-weight for your VRAM, long-context tuning with cache quantization, sampling presets, multi-GPU layer splits, and benchmarks against vLLM, llama.cpp, and TensorRT-LLM on the same hardware.

Table of Contents

  1. What ExLlamaV2 and TabbyAPI Are
  2. Why EXL2 Beats Other 4-Bit Formats on Quality
  3. Hardware & Software Requirements
  4. Installation: TabbyAPI
  5. Picking the Right EXL2 bpw for Your VRAM
  6. Downloading Pre-Quantized Models
  7. Quantizing Your Own Model to EXL2
  8. TabbyAPI Configuration
  9. Cache Quantization (Q4 / Q6 / Q8 / FP16)
  10. Long Context (32K-131K)
  11. Multi-GPU Layer Split
  12. Sampling & Prompt Templates
  13. LoRA Adapters
  14. Speculative Decoding
  15. Tool Calling / Function Calling
  16. Performance Benchmarks
  17. Tuning Recipes by GPU
  18. Common Mistakes & Fixes
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What ExLlamaV2 and TabbyAPI Are {#what-it-is}

ExLlamaV2 (turboderp's repo) is a CUDA-first inference library for INT4-quantized transformer models. It pairs hand-tuned dequant-fused matmul kernels with an aggressive batch-size-1 scheduler. The format it loads, EXL2, is its own quantization scheme — measurement-based, mixed-bit-width, optimized for inference on Ampere+ NVIDIA cards.

TabbyAPI (theroyallab/tabbyAPI) is the OpenAI-compatible server wrapping ExLlamaV2. It exposes /v1/chat/completions, /v1/completions, streaming, sampling presets, prompt templates, basic auth, and a YAML config.

Together, they are the right local inference stack when:

  • You have one NVIDIA GPU (Ampere or newer).
  • You serve one or two concurrent users.
  • You want maximum tok/s.

For multi-user concurrent serving, see vLLM; for max single-stream latency on H100-class hardware, TensorRT-LLM; for AMD / Apple, llama.cpp / Ollama.


Why EXL2 Beats Other 4-Bit Formats on Quality {#exl2-quality}

Most 4-bit formats (GPTQ, AWQ, GGUF Q4_K_M) use a uniform bit width — every weight matrix gets the same number of bits. EXL2 instead allocates bits per layer based on measurement: it runs a calibration pass, measures how much each layer's weights affect output perplexity, and gives more bits to sensitive layers (attention QKV, output projection) and fewer to less-sensitive ones (later FFN layers).

Result: at the same average bits-per-weight, EXL2 produces lower perplexity than uniform-width quants.

Llama 3.1 70B perplexity on WikiText (lower is better):

FormatAvg bitsSizePPL
FP1616.00140 GB4.96
EXL2 6.0bpw6.0053 GB4.97
EXL2 5.0bpw5.0044 GB4.99
AWQ-INT4 g1284.2536 GB5.01
GGUF Q5_K_M5.6650 GB5.00
EXL2 4.5bpw4.5040 GB5.00
EXL2 4.0bpw4.0035 GB5.05
GGUF Q4_K_M4.8342 GB5.06
GGUF IQ4_XS4.2538 GB5.04
GPTQ-128g4.2536 GB5.10

EXL2 4.0bpw is smaller than AWQ-INT4 g128 with comparable perplexity — both fit a 70B model in a 24 GB GPU with 4-6 GB left for KV cache. That 4-6 GB of KV makes a 70B model with 8K-16K context fully on-GPU practical on a single RTX 4090.


Hardware & Software Requirements {#requirements}

ComponentMinimumRecommended
GPUCompute capability 8.0+ (Ampere)RTX 3090 / 4090 / 5090 / A6000 / L40S
VRAM (8B class)8 GB12 GB+
VRAM (32B class)16 GB20 GB+
VRAM (70B class)24 GB24 GB exactly is fine
Driver535+555+
CUDA12.1+ (build only; runtime via PyTorch)12.4+
Python3.103.11-3.12
OSLinux, WindowsUbuntu 22.04 LTS
RAM16 GB32 GB+
Disk100 GBNVMe

Turing (RTX 20-series, GTX 16) is not supported — kernels require Ampere SM 8.0 features.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: TabbyAPI {#installation}

Linux / WSL2

git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pip

# Install with CUDA 12.4 wheels
./start.sh    # auto-detects CUDA, picks right wheels

The start.sh script handles installing exllamav2 + flash-attn + dependencies with the right CUDA pins.

Windows

git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
.\start.bat

The start.bat script does the equivalent venv + install.

Docker

docker run -d --name tabbyapi \
    --gpus all \
    -p 5000:5000 \
    -v $(pwd)/models:/app/models \
    -v $(pwd)/config.yml:/app/config.yml \
    ghcr.io/theroyallab/tabbyapi:latest

From source (latest)

pip install exllamav2
pip install flash-attn --no-build-isolation
pip install -r requirements.txt
python main.py --config config.yml

Default port 5000. Browse http://localhost:5000/docs for the FastAPI Swagger UI.


Picking the Right EXL2 bpw for Your VRAM {#picking-bpw}

Approximate memory budget = model_size × bpw / 8 + KV cache + overhead (~1-2 GB).

GPU VRAMRecommended Model + bpw
8 GBLlama 3.1 8B EXL2 4.0bpw, 4K ctx, Q4 cache
12 GBLlama 3.1 8B EXL2 6.0bpw, 16K ctx
16 GBQwen 2.5 14B EXL2 5.0bpw, 16K ctx
24 GBQwen 2.5 32B EXL2 5.0bpw / Llama 3.1 70B EXL2 2.4bpw / Llama 3.1 70B EXL2 4.0bpw with Q4 cache
32 GB (5090)Llama 3.1 70B EXL2 4.5bpw, 16K ctx, FP16 cache
48 GB (A6000 / 2x 24GB)Llama 3.1 70B EXL2 6.0bpw, 32K ctx
80 GB (H100 / A100)Llama 3.1 70B EXL2 8.0bpw or 405B EXL2 2.4bpw

For 70B on exactly 24 GB: 4.0bpw + Q4 cache works at 8K context. 4.5bpw works at ~4K context. 5.0bpw and above will OOM.


Downloading Pre-Quantized Models {#downloading}

The community publishes EXL2 quants on Hugging Face. Top providers:

  • turboderp — the author; smallest selection but highest quality.
  • lonestriker — extensive coverage of popular models.
  • bartowski — broad coverage, multiple bpw per model.
  • LoneStriker (case-sensitive) — overlap with above.
# Use huggingface-cli
huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-exl2 \
    --revision 4_0 \
    --local-dir ./models/Llama-3.1-70B-4.0bpw

EXL2 repos use branch revisions for different bpw — --revision 4_0 selects the 4.0bpw branch. Always check the README for available branches.


Quantizing Your Own Model to EXL2 {#quantizing}

git clone https://github.com/turboderp-org/exllamav2
cd exllamav2

# Quantize Llama 3.1 8B to 4.65bpw average
python convert.py \
    -i ./Llama-3.1-8B-Instruct \
    -o ./Llama-3.1-8B-4.65bpw \
    -nr \
    -b 4.65 \
    -hb 6 \
    -c ./calibration.parquet

Flags:

  • -b target average bpw (e.g., 4.65, 5.0, 6.5)
  • -hb head bits — output projection layer (keep at 6+ for quality)
  • -c calibration dataset (wikitext-2.parquet is the default; provide domain-specific data if your use case is narrow)
  • -nr skip resuming existing measurement file

Time: Llama 3.1 8B = 30-60 min on RTX 4090; 70B = 2-4 hours; 405B = 12-24 hours.

For a domain-specific calibration set, format your data as one column text in a Parquet file, ~1024 chunks of ~2K tokens each. Custom calibration improves perplexity on in-domain data 1-3%.


TabbyAPI Configuration {#config}

config.yml:

network:
  host: 0.0.0.0
  port: 5000
  disable_auth: false      # set true for local dev only

logging:
  log_prompt: false
  log_generation_params: false
  log_requests: false

model:
  model_dir: models/Meta-Llama-3.1-70B-Instruct-exl2-4.0bpw
  use_dummy_models: false
  inline_model_loading: true
  use_as_default: ['max_seq_len', 'cache_mode']

  max_seq_len: 16384
  override_base_seq_len:
  cache_mode: Q4               # Q4, Q6, Q8, FP16
  cache_size: 16384
  chunk_size: 2048
  prompt_template:             # auto-detected from tokenizer_config; override here

  # Multi-GPU
  gpu_split_auto: true
  autosplit_reserve: [96]      # MB reserved on first GPU
  gpu_split:                   # explicit override e.g. [22, 22]

  # Optimizations
  fasttensors: true
  use_dummy_models: false

  # Generation defaults
  rope_scale: 1.0
  rope_alpha: 1.0

draft_model:
  draft_model_dir:             # speculative decoding draft
  draft_rope_scale: 1.0

embeddings:
  embedding_model_dir:
  embeddings_device: cpu

developer:
  unsafe_launch: false
  disable_request_streaming: false
  cuda_malloc_backend: true
  uvloop: true
  realtime_process_priority: false

Run:

python main.py --config config.yml

Endpoints (OpenAI-compatible): /v1/chat/completions, /v1/completions. Plus /v1/model/load, /v1/model/unload for hot-swapping models.


Cache Quantization (Q4 / Q6 / Q8 / FP16) {#cache-quantization}

cache_modebytes/token (per layer)Perplexity lossUse Case
FP1640%Maximum quality, plenty of VRAM
Q82~0.1%Default for medium contexts
Q61.5~0.3%Good balance
Q41~1.0%Long context on tight VRAM

For Llama 3.1 70B at 16K context: FP16 cache = ~5 GB, Q8 = ~2.5 GB, Q4 = ~1.25 GB. The savings from Q4 cache often allow you to fit a higher-bpw model on the same GPU.

Real example on RTX 4090 with 70B at 16K:

  • 4.0bpw + Q4 cache → fits with 1 GB free
  • 4.5bpw + Q4 cache → just barely fits
  • 4.0bpw + Q6 cache → fits with ~500 MB free
  • 4.0bpw + FP16 cache → OOM

Long Context (32K-131K) {#long-context}

model:
  max_seq_len: 131072
  cache_mode: Q4
  cache_size: 131072
  chunk_size: 4096          # prompt processing chunk

For RoPE scaling on models that need it (e.g., scaling Llama 3 8K to 32K):

model:
  rope_scale: 4.0           # linear interpolation
  rope_alpha: 1.0           # NTK-aware

Llama 3.1 (8K → 131K) and Qwen 2.5 (32K → 131K with YaRN) handle scaling natively in the model — no additional flags. For Llama 3 (the older one) extending past 8K, set rope_scale and rope_alpha based on YaRN or NTK-aware formulas.


Multi-GPU Layer Split {#multi-gpu}

ExLlamaV2 splits layers across GPUs (pipeline-parallel-like, not tensor-parallel).

model:
  gpu_split_auto: true
  autosplit_reserve: [96, 96]   # reserve 96 MB on each GPU

Or explicit:

model:
  gpu_split_auto: false
  gpu_split: [22, 22]            # 22 GB on GPU0, 22 GB on GPU1

For 2x RTX 3090 with NVLink, expect 1.4-1.6x of single-3090 throughput on 70B 4.5bpw. Without NVLink (RTX 4090 / 5090), ~1.3-1.5x. Inferior to vLLM tensor parallel, but simpler.


Sampling & Prompt Templates {#sampling}

TabbyAPI exposes the full sampler stack via the OpenAI-compatible API plus extension fields:

{
  "model": "Llama-3.1-70B",
  "messages": [{"role": "user", "content": "Hello"}],
  "temperature": 0.7,
  "top_p": 0.9,
  "min_p": 0.05,
  "top_k": 0,
  "repetition_penalty": 1.05,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "smoothing_factor": 0.0,
  "skew": 0.0,
  "min_tokens": 0,
  "max_tokens": 1024,
  "stop": ["<|eot_id|>"],
  "stream": true
}

DRY and XTC are not yet in TabbyAPI as of mid-2026 — for those, use oobabooga's text-generation-webui or KoboldCpp. See LLM Sampling Parameters for what each parameter does.

Prompt templates

TabbyAPI auto-detects prompt templates from tokenizer_config.json (Jinja). Override with prompt_template in the model section if needed. Supplied templates: alpaca, chatml, llama3, mistral, vicuna, phi3, etc., in templates/.


LoRA Adapters {#lora}

loras:
  - name: my-style
    scaling: 1.0

Place the LoRA folder under loras/my-style/. ExLlamaV2 supports peft-format LoRAs converted with the exllamav2.lora utilities.

Per-request LoRA via the API:

{
  "model": "...",
  "messages": [...],
  "lora_request": {"name": "my-style", "scaling": 1.0}
}

Multiple LoRAs can be loaded simultaneously and selected per request.


Speculative Decoding {#speculative}

draft_model:
  draft_model_dir: models/Llama-3.2-1B-Instruct-exl2-4.5bpw
  draft_rope_scale: 1.0
  draft_cache_mode: Q4

Pair Llama 3.1 70B target with Llama 3.2 1B draft (same vocab). Expected speedup: 1.5-2.0x at batch size 1 with high acceptance rate. See CUDA Optimization for theory.


Tool Calling / Function Calling {#tool-calling}

TabbyAPI supports OpenAI-style tools for any chat-tuned model with a tool template (Llama 3.1+, Qwen 2.5+, Mistral Large 2). Pass tools and tool_choice as in the OpenAI spec; TabbyAPI handles parsing the model output and returning structured tool_calls in the response.

For consistent JSON output without grammars, use temperature ≤ 0.3 and pair with a guidance/outlines wrapper at the application layer.


Performance Benchmarks {#benchmarks}

Single user, batch size 1, 128 output tokens, 4K context, RTX 4090.

Llama 3.1 8B Q4-class

Frameworktok/s
Ollama (Q4_K_M)127
llama.cpp (Q4_K_M, FA)130
vLLM (AWQ-INT4)155
TabbyAPI / ExLlamaV2 (EXL2 4.0bpw)165
TensorRT-LLM (AWQ)178

Llama 3.1 70B Q4-class

Frameworktok/sVRAM
Ollama (Q4_K_M)8 (offload)24+15
llama.cpp (Q4_K_M, FA)9 (offload)24+15
vLLM (AWQ-INT4)OOM24
TabbyAPI / ExLlamaV2 (EXL2 4.0bpw + Q4 cache)2224
2x RTX 4090 vLLM (AWQ TP=2)3848
2x RTX 4090 TabbyAPI (EXL2 5.0bpw)3148

For single-GPU 70B inference, TabbyAPI is the only practical option at usable speeds.


Tuning Recipes by GPU {#tuning}

RTX 3090 (24 GB)

model:
  model_dir: models/Llama-3.1-70B-EXL2-4.0bpw
  max_seq_len: 8192
  cache_mode: Q4
  cache_size: 8192
  chunk_size: 2048
  fasttensors: true

RTX 4090 (24 GB)

Same as 3090 but max_seq_len: 16384 works comfortably with Q4 cache.

RTX 5090 (32 GB)

model:
  model_dir: models/Llama-3.1-70B-EXL2-4.5bpw
  max_seq_len: 16384
  cache_mode: Q6
  cache_size: 16384

A6000 / RTX 6000 Ada (48 GB)

model:
  model_dir: models/Llama-3.1-70B-EXL2-6.0bpw
  max_seq_len: 32768
  cache_mode: Q8
model:
  gpu_split: [22, 22]
  model_dir: models/Llama-3.1-70B-EXL2-5.0bpw
  max_seq_len: 16384
  cache_mode: Q6

Common Mistakes & Fixes {#troubleshooting}

SymptomCauseFix
OOM at loadbpw + cache too largeLower bpw or cache_mode
OOM mid-generationLong context grew KVLower max_seq_len or use Q4 cache
Slow first tokenPrompt prefillSet chunk_size higher (4096)
Garbled outputWrong prompt templateSet explicit prompt_template: llama3
Multi-GPU slower than singleNVLink not usedVerify NVLink with nvidia-smi nvlink -s
flash-attn install failsWrong CUDA / wheelUse ./start.sh which pins versions
Model loads but errors on first requestTokenizer mismatchRe-download repo, ensure tokenizer files present
Repetition / loopsSampling too narrowIncrease min_p to 0.05, add rep_penalty 1.05

FAQ {#faq}

See answers to common ExLlamaV2 / TabbyAPI questions below.


Sources: ExLlamaV2 GitHub | TabbyAPI GitHub | bartowski's EXL2 quants on HF | turboderp's quants | Internal benchmarks RTX 3090, 4090, 5090.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a TabbyAPI + ExLlamaV2 reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators