text-generation-webui (oobabooga) Complete Guide (2026): Setup, Loaders, Extensions
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
text-generation-webui — almost everyone calls it "oobabooga" after its maintainer — is the Swiss Army knife of local LLM UIs. It is the only major UI that supports multiple inference loaders behind one interface: Transformers, llama.cpp, ExLlamaV2, TensorRT-LLM, and HQQ. It ships built-in QLoRA fine-tuning, RAG via superboogav2, multimodal (vision) support, an OpenAI-compatible API, and a deep extension system. If you want to experiment broadly across model formats and capabilities without reinstalling tooling, this is the right starting point.
This guide covers everything: the one-click installer, picking the right loader for your model format, the Chat / Default / Notebook modes, the most useful extensions, QLoRA fine-tuning, OpenAI API mode, sampling presets, character cards, SillyTavern integration, and tuning recipes for common GPUs.
Table of Contents
- What text-generation-webui Is
- Loader Selection: When to Use Which
- Hardware Requirements
- Installation: One-Click and Manual
- The Three UI Modes (Chat, Default, Notebook)
- Loading Models
- Sampler Presets
- Character Cards & Personas
- The OpenAI-Compatible API Extension
- Built-In RAG: superboogav2
- Multimodal: LLaVA, Llama 3.2 Vision
- Whisper STT and Silero TTS Extensions
- QLoRA Fine-Tuning in the Training Tab
- SillyTavern Integration
- Useful Extensions Beyond the Defaults
- Long Context Tuning
- Tuning Recipes by GPU
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What text-generation-webui Is {#what-it-is}
A Gradio-based Python application that:
- Loads LLMs via 5+ swappable loaders (Transformers / llama.cpp / ExLlamaV2 / TensorRT-LLM / HQQ).
- Provides three UI modes: Chat, Default (single-turn), Notebook (free-form).
- Exposes every modern sampler including DRY, XTC, and Mirostat.
- Supports character cards, personas, and world info.
- Ships QLoRA fine-tuning in a Training tab.
- Has 50+ official + community extensions.
- Runs on Windows / Linux / macOS / Docker with one-click installer.
Project: github.com/oobabooga/text-generation-webui. License: AGPL-3.0.
Loader Selection: When to Use Which {#loaders}
| Loader | Format | Speed | When |
|---|---|---|---|
| Transformers | HF FP16/BF16, AWQ, GPTQ | Slow | Quality testing; HF ecosystem |
| llama.cpp | GGUF | Fast | CPU / Mac / mixed GPU offload |
| ExLlamaV2 | EXL2 | Fastest single-GPU NVIDIA | RTX 30/40/50 single-user |
| TensorRT-LLM | TRT-LLM engines | Lowest latency | Production NVIDIA |
| HQQ | HQQ format | Fast | New experimental formats |
For consumer NVIDIA + creative writing: ExLlamaV2 + EXL2 quants. For broad model coverage / Mac / CPU: llama.cpp + GGUF. For experimentation across formats: keep the option to swap.
Hardware Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| GPU | None (CPU works for small models) | 12 GB VRAM+ for 8B models |
| RAM | 16 GB | 32 GB |
| Disk | 20 GB free | NVMe; models 4-50 GB each |
| Python | 3.10 (installer manages) | 3.11 |
| OS | Windows 10/11, Linux, macOS 12+ | Ubuntu 22.04 LTS |
For NVIDIA: CUDA 11.8+ driver. For AMD: ROCm 6.x or Vulkan. For Apple: Metal (auto). For Intel Arc: Vulkan or oneAPI.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: One-Click and Manual {#installation}
One-click (recommended)
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
# Windows: double-click start_windows.bat
# Linux:
./start_linux.sh
# macOS:
./start_macos.sh
The installer asks for your GPU vendor on first run and configures everything. UI launches at http://localhost:7860.
Manual install
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
python3.11 -m venv venv
source venv/bin/activate
# Pick one PyTorch install
pip install torch --index-url https://download.pytorch.org/whl/cu124 # NVIDIA
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2 # AMD
pip install torch # Mac
# Install the rest
pip install -r requirements.txt
# Run
python server.py
Docker
docker run --gpus all -p 7860:7860 -p 5000:5000 \
-v $(pwd)/models:/app/models \
-v $(pwd)/loras:/app/loras \
-v $(pwd)/characters:/app/characters \
atinoda/text-generation-webui:latest
The Three UI Modes (Chat, Default, Notebook) {#ui-modes}
- Chat — multi-turn chat with personas / characters. Most common.
- Default — single-turn instruction. Best for quick tests and benchmarking.
- Notebook — free-form text editor; you author both sides. Best for fiction.
Switch via the top tab. Each mode keeps its own conversation history.
Loading Models {#loading-models}
- Place model files under
models/.- GGUF: a single
.gguffile inmodels/<model-name>/. - EXL2: the entire repo folder under
models/<model-name>/. - HuggingFace Transformers: full HF folder (config, tokenizer, weights) under
models/<model-name>/.
- GGUF: a single
- Open the Model tab.
- Pick the model from the dropdown, pick the loader (auto-detected for most formats).
- Click Load. The status bar shows the loader and VRAM use.
Common loader settings:
- n-gpu-layers (llama.cpp) — number of layers on GPU
- max_seq_len (ExLlamaV2 / Transformers) — context length
- load-in-4bit / 8bit (Transformers) — bitsandbytes quantization
- cache_8bit / cache_4bit (ExLlamaV2) — KV cache quantization
- flash_attn (llama.cpp) — FlashAttention
You can also download models directly from HuggingFace via the Download model or LoRA field (paste a HF model ID).
Sampler Presets {#samplers}
The Parameters tab has every sampler discussed in our LLM Sampling Parameters guide:
- temperature, top_k, top_p, min_p, typical_p
- smoothing_factor, smoothing_curve, dynamic temperature
- repetition_penalty, presence_penalty, frequency_penalty
- DRY: dry_multiplier, dry_base, dry_allowed_length
- XTC: xtc_threshold, xtc_probability
- mirostat (v1/v2)
- grammar (GBNF), logit bias
Saved presets ship in presets/. Defaults include:
- min_p — modern default with min_p truncation
- simple-1 — temp 0.7, top_p 0.9
- mirostat — adaptive
- divine_intellect — DRY-heavy creative
- midnight_enigma — XTC + DRY for fiction
Custom presets save to presets/ and appear in the dropdown next launch.
Character Cards & Personas {#character-cards}
Character cards are JSON files describing a persona, behavior, and dialogue style. Place under characters/<name>.json or characters/<name>.png (V2 cards embed JSON in PNG metadata).
Example:
{
"name": "Ada",
"description": "Ada is a senior software engineer specializing in distributed systems. She is precise, direct, and prefers concrete examples to theory.",
"personality": "Concise, expert, slightly impatient with vague questions.",
"first_mes": "Hi. What are we debugging today?",
"mes_example": "<USER>: Why is my Postgres slow?\n<BOT>: Need more info. Can you share EXPLAIN ANALYZE output for the slow query?"
}
V2 character cards (with PNG art and richer metadata) are widely shared on Chub, Pygmalion's site, and Discord. Drag-and-drop the PNG into the Character gallery tab to import.
The OpenAI-Compatible API Extension {#api-extension}
Launch with --api:
python server.py --api --listen --listen-port 7860 --api-port 5000
Now http://localhost:5000/v1/chat/completions works with any OpenAI client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:5000/v1", api_key="any")
resp = client.chat.completions.create(
model="llama-3.1-8b",
messages=[{"role": "user", "content": "Hello"}],
temperature=0.7,
extra_body={"min_p": 0.05, "dry_multiplier": 0.8},
)
Authentication: --api-key sk-.... Pass via Authorization: Bearer sk-... header.
Built-In RAG: superboogav2 {#superbooga}
python server.py --extensions superboogav2
The extension uses ChromaDB as the local vector store and supports drag-and-drop document ingestion (PDF, DOCX, TXT, MD, HTML). Configure chunk size, embedding model (defaults to all-mpnet-base-v2), and similarity threshold from the UI.
For production-grade RAG with scaling, see Ollama ChromaDB RAG Pipeline and Vector Databases Comparison.
Multimodal: LLaVA, Llama 3.2 Vision {#multimodal}
python server.py --extensions multimodal --multimodal-pipeline llama32-vision
Drag images into the chat input. Compatible models: LLaVA 1.5 / 1.6, Llama 3.2 11B / 90B Vision, Qwen 2-VL, Pixtral 12B, MiniCPM-V.
Whisper STT and Silero TTS Extensions {#audio}
python server.py --extensions whisper_stt silero_tts
whisper_stt adds a microphone button (transcribes spoken input). silero_tts reads bot responses aloud (offline neural TTS, several voices). For dedicated audio-AI deployments see Whisper Local Speech-to-Text.
QLoRA Fine-Tuning in the Training Tab {#training}
Tab: Training → Train LoRA.
Setup:
- Load a Transformers model (NOT a quantized loader).
- Pass
--load-in-4bitto enable QLoRA mode. - Provide a dataset:
- Format: JSON list of
{"instruction": "...", "input": "...", "output": "..."}or raw text file.
- Format: JSON list of
- Set hyperparameters:
LoRA Rank: 8-32 (32 for more capacity)LoRA Alpha: 2x rank typicallyLearning Rate: 1e-4 to 3e-4Batch Size: 1-4 (depends on VRAM)Epochs: 3-5
- Click Start LoRA Training.
Output saves to loras/<name>/. Apply at inference via the Model tab → "LoRA(s)" dropdown.
For deeper training (DPO, full fine-tuning, larger models, multi-GPU) use Axolotl or Unsloth instead — see our LoRA Fine-Tuning Local Guide.
SillyTavern Integration {#sillytavern}
Launch oobabooga with --api:
python server.py --api --listen --listen-port 7860 --api-port 5000
In SillyTavern: API → "Text Completion" → API type "Default" → URL http://localhost:5000. Or use OpenAI-compatible API mode: API type "OpenAI" → URL http://localhost:5000/v1.
The latter exposes the OpenAI subset; the former exposes more samplers. Pick OpenAI for portability, Default for full sampler access.
Useful Extensions Beyond the Defaults {#extensions}
| Extension | Purpose |
|---|---|
superboogav2 | Local RAG with ChromaDB |
whisper_stt | Microphone speech-to-text |
silero_tts | Offline neural text-to-speech |
multimodal | Vision models (LLaVA family) |
memoir+ | Persistent memory across sessions |
long_replies | Bias toward longer outputs |
character_bias | Steering vectors for personas |
google_translate | Translate user / bot turns |
coqui_tts | Coqui XTTS v2 voice cloning |
web_search | DuckDuckGo / SearXNG integration |
message_intercepter | Hook for log filtering |
Browse extensions/ directory or the awesome-tgw community list. Enable in settings.yaml:
default_extensions:
- openai
- superboogav2
- whisper_stt
Or via CLI: --extensions openai superboogav2.
Long Context Tuning {#long-context}
For long-context (32K-131K) inference:
- llama.cpp loader: set
n_ctxto your target. Enable Flash Attention. Use Q4 KV cache viacache-type-k q4_0 cache-type-v q4_0. - ExLlamaV2 loader: set
max_seq_lenandcache_4bitto True. See ExLlamaV2 + TabbyAPI Guide. - Transformers loader: rope_theta is auto-loaded from config. For RoPE scaling, set
rope_freq_baseandrope_freq_scalein advanced settings.
Tuning Recipes by GPU {#tuning}
RTX 4090 (24 GB) — ExLlamaV2
loader: ExLlamaV2_HF
model: Llama-3.1-70B-Instruct-exl2-4.0bpw
max_seq_len: 16384
cache_4bit: true
RTX 3060 12 GB — llama.cpp
loader: llama.cpp
model: llama-3.1-8b-instruct-Q5_K_M.gguf
n_ctx: 16384
n_gpu_layers: 999
flash_attn: true
Mac M4 Max — llama.cpp Metal
loader: llama.cpp
model: llama-3.1-70b-instruct-Q4_K_M.gguf
n_ctx: 16384
n_gpu_layers: 999
flash_attn: true
CPU only
loader: llama.cpp
model: llama-3.2-3b-instruct-Q4_K_M.gguf
n_ctx: 8192
threads: 8
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| One-click installer fails | Network / proxy | Manual install with explicit --index-url |
| Model OOM at load | VRAM too tight | Lower max_seq_len, enable cache quant |
| ExLlamaV2 fails | flash-attn version | Upgrade flash-attn or disable in advanced |
| Garbled output | Wrong instruction template | Set explicit "Instruction template" in Parameters tab |
| Extension import error | Missing pip dep | Run extension's requirements.txt |
| API mode 404s | --api flag missing | Restart with --api |
| QLoRA training OOM | Batch size too high | Drop to batch_size=1, gradient_accumulation=8 |
| Slow on llama.cpp | n_gpu_layers wrong | Set to 999 to push everything to GPU |
FAQ {#faq}
See answers to common text-generation-webui questions below.
Sources: text-generation-webui GitHub | Gradio docs | Internal benchmarks RTX 3090, 4090, 5090, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!