★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Tools

text-generation-webui (oobabooga) Complete Guide (2026): Setup, Loaders, Extensions

May 1, 2026
28 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

text-generation-webui — almost everyone calls it "oobabooga" after its maintainer — is the Swiss Army knife of local LLM UIs. It is the only major UI that supports multiple inference loaders behind one interface: Transformers, llama.cpp, ExLlamaV2, TensorRT-LLM, and HQQ. It ships built-in QLoRA fine-tuning, RAG via superboogav2, multimodal (vision) support, an OpenAI-compatible API, and a deep extension system. If you want to experiment broadly across model formats and capabilities without reinstalling tooling, this is the right starting point.

This guide covers everything: the one-click installer, picking the right loader for your model format, the Chat / Default / Notebook modes, the most useful extensions, QLoRA fine-tuning, OpenAI API mode, sampling presets, character cards, SillyTavern integration, and tuning recipes for common GPUs.

Table of Contents

  1. What text-generation-webui Is
  2. Loader Selection: When to Use Which
  3. Hardware Requirements
  4. Installation: One-Click and Manual
  5. The Three UI Modes (Chat, Default, Notebook)
  6. Loading Models
  7. Sampler Presets
  8. Character Cards & Personas
  9. The OpenAI-Compatible API Extension
  10. Built-In RAG: superboogav2
  11. Multimodal: LLaVA, Llama 3.2 Vision
  12. Whisper STT and Silero TTS Extensions
  13. QLoRA Fine-Tuning in the Training Tab
  14. SillyTavern Integration
  15. Useful Extensions Beyond the Defaults
  16. Long Context Tuning
  17. Tuning Recipes by GPU
  18. Troubleshooting
  19. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What text-generation-webui Is {#what-it-is}

A Gradio-based Python application that:

  • Loads LLMs via 5+ swappable loaders (Transformers / llama.cpp / ExLlamaV2 / TensorRT-LLM / HQQ).
  • Provides three UI modes: Chat, Default (single-turn), Notebook (free-form).
  • Exposes every modern sampler including DRY, XTC, and Mirostat.
  • Supports character cards, personas, and world info.
  • Ships QLoRA fine-tuning in a Training tab.
  • Has 50+ official + community extensions.
  • Runs on Windows / Linux / macOS / Docker with one-click installer.

Project: github.com/oobabooga/text-generation-webui. License: AGPL-3.0.


Loader Selection: When to Use Which {#loaders}

LoaderFormatSpeedWhen
TransformersHF FP16/BF16, AWQ, GPTQSlowQuality testing; HF ecosystem
llama.cppGGUFFastCPU / Mac / mixed GPU offload
ExLlamaV2EXL2Fastest single-GPU NVIDIARTX 30/40/50 single-user
TensorRT-LLMTRT-LLM enginesLowest latencyProduction NVIDIA
HQQHQQ formatFastNew experimental formats

For consumer NVIDIA + creative writing: ExLlamaV2 + EXL2 quants. For broad model coverage / Mac / CPU: llama.cpp + GGUF. For experimentation across formats: keep the option to swap.


Hardware Requirements {#requirements}

ComponentMinimumRecommended
GPUNone (CPU works for small models)12 GB VRAM+ for 8B models
RAM16 GB32 GB
Disk20 GB freeNVMe; models 4-50 GB each
Python3.10 (installer manages)3.11
OSWindows 10/11, Linux, macOS 12+Ubuntu 22.04 LTS

For NVIDIA: CUDA 11.8+ driver. For AMD: ROCm 6.x or Vulkan. For Apple: Metal (auto). For Intel Arc: Vulkan or oneAPI.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: One-Click and Manual {#installation}

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

# Windows: double-click start_windows.bat
# Linux:
./start_linux.sh
# macOS:
./start_macos.sh

The installer asks for your GPU vendor on first run and configures everything. UI launches at http://localhost:7860.

Manual install

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
python3.11 -m venv venv
source venv/bin/activate

# Pick one PyTorch install
pip install torch --index-url https://download.pytorch.org/whl/cu124       # NVIDIA
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2     # AMD
pip install torch                                                            # Mac

# Install the rest
pip install -r requirements.txt

# Run
python server.py

Docker

docker run --gpus all -p 7860:7860 -p 5000:5000 \
    -v $(pwd)/models:/app/models \
    -v $(pwd)/loras:/app/loras \
    -v $(pwd)/characters:/app/characters \
    atinoda/text-generation-webui:latest

The Three UI Modes (Chat, Default, Notebook) {#ui-modes}

  • Chat — multi-turn chat with personas / characters. Most common.
  • Default — single-turn instruction. Best for quick tests and benchmarking.
  • Notebook — free-form text editor; you author both sides. Best for fiction.

Switch via the top tab. Each mode keeps its own conversation history.


Loading Models {#loading-models}

  1. Place model files under models/.
    • GGUF: a single .gguf file in models/<model-name>/.
    • EXL2: the entire repo folder under models/<model-name>/.
    • HuggingFace Transformers: full HF folder (config, tokenizer, weights) under models/<model-name>/.
  2. Open the Model tab.
  3. Pick the model from the dropdown, pick the loader (auto-detected for most formats).
  4. Click Load. The status bar shows the loader and VRAM use.

Common loader settings:

  • n-gpu-layers (llama.cpp) — number of layers on GPU
  • max_seq_len (ExLlamaV2 / Transformers) — context length
  • load-in-4bit / 8bit (Transformers) — bitsandbytes quantization
  • cache_8bit / cache_4bit (ExLlamaV2) — KV cache quantization
  • flash_attn (llama.cpp) — FlashAttention

You can also download models directly from HuggingFace via the Download model or LoRA field (paste a HF model ID).


Sampler Presets {#samplers}

The Parameters tab has every sampler discussed in our LLM Sampling Parameters guide:

  • temperature, top_k, top_p, min_p, typical_p
  • smoothing_factor, smoothing_curve, dynamic temperature
  • repetition_penalty, presence_penalty, frequency_penalty
  • DRY: dry_multiplier, dry_base, dry_allowed_length
  • XTC: xtc_threshold, xtc_probability
  • mirostat (v1/v2)
  • grammar (GBNF), logit bias

Saved presets ship in presets/. Defaults include:

  • min_p — modern default with min_p truncation
  • simple-1 — temp 0.7, top_p 0.9
  • mirostat — adaptive
  • divine_intellect — DRY-heavy creative
  • midnight_enigma — XTC + DRY for fiction

Custom presets save to presets/ and appear in the dropdown next launch.


Character Cards & Personas {#character-cards}

Character cards are JSON files describing a persona, behavior, and dialogue style. Place under characters/<name>.json or characters/<name>.png (V2 cards embed JSON in PNG metadata).

Example:

{
  "name": "Ada",
  "description": "Ada is a senior software engineer specializing in distributed systems. She is precise, direct, and prefers concrete examples to theory.",
  "personality": "Concise, expert, slightly impatient with vague questions.",
  "first_mes": "Hi. What are we debugging today?",
  "mes_example": "<USER>: Why is my Postgres slow?\n<BOT>: Need more info. Can you share EXPLAIN ANALYZE output for the slow query?"
}

V2 character cards (with PNG art and richer metadata) are widely shared on Chub, Pygmalion's site, and Discord. Drag-and-drop the PNG into the Character gallery tab to import.


The OpenAI-Compatible API Extension {#api-extension}

Launch with --api:

python server.py --api --listen --listen-port 7860 --api-port 5000

Now http://localhost:5000/v1/chat/completions works with any OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:5000/v1", api_key="any")

resp = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Hello"}],
    temperature=0.7,
    extra_body={"min_p": 0.05, "dry_multiplier": 0.8},
)

Authentication: --api-key sk-.... Pass via Authorization: Bearer sk-... header.


Built-In RAG: superboogav2 {#superbooga}

python server.py --extensions superboogav2

The extension uses ChromaDB as the local vector store and supports drag-and-drop document ingestion (PDF, DOCX, TXT, MD, HTML). Configure chunk size, embedding model (defaults to all-mpnet-base-v2), and similarity threshold from the UI.

For production-grade RAG with scaling, see Ollama ChromaDB RAG Pipeline and Vector Databases Comparison.


Multimodal: LLaVA, Llama 3.2 Vision {#multimodal}

python server.py --extensions multimodal --multimodal-pipeline llama32-vision

Drag images into the chat input. Compatible models: LLaVA 1.5 / 1.6, Llama 3.2 11B / 90B Vision, Qwen 2-VL, Pixtral 12B, MiniCPM-V.


Whisper STT and Silero TTS Extensions {#audio}

python server.py --extensions whisper_stt silero_tts

whisper_stt adds a microphone button (transcribes spoken input). silero_tts reads bot responses aloud (offline neural TTS, several voices). For dedicated audio-AI deployments see Whisper Local Speech-to-Text.


QLoRA Fine-Tuning in the Training Tab {#training}

Tab: TrainingTrain LoRA.

Setup:

  1. Load a Transformers model (NOT a quantized loader).
  2. Pass --load-in-4bit to enable QLoRA mode.
  3. Provide a dataset:
    • Format: JSON list of {"instruction": "...", "input": "...", "output": "..."} or raw text file.
  4. Set hyperparameters:
    • LoRA Rank: 8-32 (32 for more capacity)
    • LoRA Alpha: 2x rank typically
    • Learning Rate: 1e-4 to 3e-4
    • Batch Size: 1-4 (depends on VRAM)
    • Epochs: 3-5
  5. Click Start LoRA Training.

Output saves to loras/<name>/. Apply at inference via the Model tab → "LoRA(s)" dropdown.

For deeper training (DPO, full fine-tuning, larger models, multi-GPU) use Axolotl or Unsloth instead — see our LoRA Fine-Tuning Local Guide.


SillyTavern Integration {#sillytavern}

Launch oobabooga with --api:

python server.py --api --listen --listen-port 7860 --api-port 5000

In SillyTavern: API → "Text Completion" → API type "Default" → URL http://localhost:5000. Or use OpenAI-compatible API mode: API type "OpenAI" → URL http://localhost:5000/v1.

The latter exposes the OpenAI subset; the former exposes more samplers. Pick OpenAI for portability, Default for full sampler access.


Useful Extensions Beyond the Defaults {#extensions}

ExtensionPurpose
superboogav2Local RAG with ChromaDB
whisper_sttMicrophone speech-to-text
silero_ttsOffline neural text-to-speech
multimodalVision models (LLaVA family)
memoir+Persistent memory across sessions
long_repliesBias toward longer outputs
character_biasSteering vectors for personas
google_translateTranslate user / bot turns
coqui_ttsCoqui XTTS v2 voice cloning
web_searchDuckDuckGo / SearXNG integration
message_intercepterHook for log filtering

Browse extensions/ directory or the awesome-tgw community list. Enable in settings.yaml:

default_extensions:
  - openai
  - superboogav2
  - whisper_stt

Or via CLI: --extensions openai superboogav2.


Long Context Tuning {#long-context}

For long-context (32K-131K) inference:

  • llama.cpp loader: set n_ctx to your target. Enable Flash Attention. Use Q4 KV cache via cache-type-k q4_0 cache-type-v q4_0.
  • ExLlamaV2 loader: set max_seq_len and cache_4bit to True. See ExLlamaV2 + TabbyAPI Guide.
  • Transformers loader: rope_theta is auto-loaded from config. For RoPE scaling, set rope_freq_base and rope_freq_scale in advanced settings.

Tuning Recipes by GPU {#tuning}

RTX 4090 (24 GB) — ExLlamaV2

loader: ExLlamaV2_HF
model: Llama-3.1-70B-Instruct-exl2-4.0bpw
max_seq_len: 16384
cache_4bit: true

RTX 3060 12 GB — llama.cpp

loader: llama.cpp
model: llama-3.1-8b-instruct-Q5_K_M.gguf
n_ctx: 16384
n_gpu_layers: 999
flash_attn: true

Mac M4 Max — llama.cpp Metal

loader: llama.cpp
model: llama-3.1-70b-instruct-Q4_K_M.gguf
n_ctx: 16384
n_gpu_layers: 999
flash_attn: true

CPU only

loader: llama.cpp
model: llama-3.2-3b-instruct-Q4_K_M.gguf
n_ctx: 8192
threads: 8

Troubleshooting {#troubleshooting}

SymptomCauseFix
One-click installer failsNetwork / proxyManual install with explicit --index-url
Model OOM at loadVRAM too tightLower max_seq_len, enable cache quant
ExLlamaV2 failsflash-attn versionUpgrade flash-attn or disable in advanced
Garbled outputWrong instruction templateSet explicit "Instruction template" in Parameters tab
Extension import errorMissing pip depRun extension's requirements.txt
API mode 404s--api flag missingRestart with --api
QLoRA training OOMBatch size too highDrop to batch_size=1, gradient_accumulation=8
Slow on llama.cppn_gpu_layers wrongSet to 999 to push everything to GPU

FAQ {#faq}

See answers to common text-generation-webui questions below.


Sources: text-generation-webui GitHub | Gradio docs | Internal benchmarks RTX 3090, 4090, 5090, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes oobabooga + Open WebUI + LiteLLM reference deploys. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators