Question 1

What is text-generation-webui (oobabooga) and how is it different from Ollama or KoboldCpp?

Accepted Answer

text-generation-webui (TGW), also called oobabooga after its creator, is a Gradio-based web UI for local LLMs that supports multiple inference loaders: Transformers (HuggingFace), llama.cpp (GGUF), ExLlamaV2 (EXL2), TensorRT-LLM, and HQQ. It is the only major UI that lets you swap loaders for the same UX. Compared to Ollama: TGW is much more configurable, supports more quantization formats, and exposes every sampler. Compared to KoboldCpp: TGW has more loaders and a more developer-friendly extension system, but ships as a Python project that needs install (KoboldCpp is one binary). Best for: power users who want to experiment with different loaders, fine-tuning, and extensions.

Question 2

Does the one-click installer still work in 2026?

Accepted Answer

Yes — the one-click installer (`start_windows.bat`, `start_linux.sh`, `start_macos.sh`) handles everything: Miniconda install, Python venv, GPU detection, PyTorch + CUDA / ROCm / Metal install, llama.cpp build, ExLlamaV2 install, and dependency wheels. On Windows it produces a self-contained `installer_files/` directory you can move or delete. The installer asks for your GPU vendor (NVIDIA, AMD, Apple, Intel, CPU) at first run and pins the right wheels. Total install time: 5-15 minutes depending on bandwidth.

Question 3

Which loader should I pick: Transformers, llama.cpp, ExLlamaV2, or TensorRT-LLM?

Accepted Answer

Pick the loader that matches your model format. **Transformers** for unquantized FP16/BF16 or HuggingFace AWQ/GPTQ — best HF ecosystem compatibility, slowest. **llama.cpp** for GGUF — broadest model coverage, runs on CPU/GPU/Mac, good speed. **ExLlamaV2** for EXL2 — fastest single-GPU NVIDIA INT4 inference (see [our EXL2 guide](/blog/exllamav2-tabbyapi-guide)). **TensorRT-LLM** for pre-built engines — lowest latency on NVIDIA. **HQQ** for HQQ-quantized models. Most users on consumer NVIDIA stick with ExLlamaV2 (for EXL2) or llama.cpp (for GGUF).

Question 4

Can I fine-tune with QLoRA in oobabooga?

Accepted Answer

Yes — the Training tab supports QLoRA fine-tuning of any HuggingFace model loaded with the Transformers loader. You provide a JSON or text dataset, set rank (typically 8-32), alpha (usually 2x rank), learning rate (1e-4 to 3e-4), and batch size. On RTX 4090, fine-tuning Llama 3.1 8B QLoRA with 1K examples takes ~30-60 min. The result is a saved LoRA adapter in `loras/` that you can apply at inference time. For more demanding fine-tunes (DPO, full fine-tuning, larger models) use Axolotl or Unsloth instead.

Question 5

How do I expose oobabooga as an OpenAI-compatible API?

Accepted Answer

Launch with `--api`: `python server.py --api --listen --listen-port 5000`. The OpenAI endpoints `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, and `/v1/models` are then available on port 5000 (the UI stays on its default port). Pass `--api-port 5000` and `--public-api` to override. Authentication via `--api-key`. Compatible with any OpenAI client; just point `base_url` to `http://localhost:5000/v1`.

Question 6

What are the most useful extensions?

Accepted Answer

Top picks: **superbooga / superboogav2** (built-in RAG with ChromaDB), **silero_tts** (offline text-to-speech), **whisper_stt** (microphone input), **api** (OpenAI-compatible REST), **openai** (alias for the api extension), **multimodal** (LLaVA / vision models), **long_replies** (push for longer outputs), **memoir+** (persistent memory across sessions), **send_pictures** (drag-and-drop images), **character_bias** (steering vectors). Browse `extensions/` directory or the community list. Enable in `settings.yaml` or via CLI: `--extensions superboogav2 openai whisper_stt`.

Question 7

How does oobabooga compare to KoboldCpp for roleplay and creative writing?

Accepted Answer

KoboldCpp is purpose-built for creative writing with the Story / Adventure modes; oobabooga's Chat mode plus character cards is roughly equivalent. Both expose DRY, XTC, and full sampler stacks. KoboldCpp is simpler (one file) and faster to start; oobabooga is more flexible (more loaders, fine-tuning, extensions). For SillyTavern users, both work as backends — KoboldCpp via its native API, oobabooga via the OpenAI extension. Pick KoboldCpp for portability and lowest-effort setup; pick oobabooga for power-user flexibility and fine-tuning.

Question 8

Is oobabooga still actively maintained?

Accepted Answer

Yes — actively maintained by oobabooga (the maintainer goes by that handle) with frequent releases, security fixes, and loader updates as new ones land. The project is on GitHub at oobabooga/text-generation-webui, has 40K+ stars, and ships updates roughly weekly. New loaders (TensorRT-LLM, HQQ) are added; older deprecated paths are removed. The project remains the most flexible local LLM UI in 2026 and is the recommended starting point if you want to experiment widely.

Loader	Format	Speed	When
Transformers	HF FP16/BF16, AWQ, GPTQ	Slow	Quality testing; HF ecosystem
llama.cpp	GGUF	Fast	CPU / Mac / mixed GPU offload
ExLlamaV2	EXL2	Fastest single-GPU NVIDIA	RTX 30/40/50 single-user
TensorRT-LLM	TRT-LLM engines	Lowest latency	Production NVIDIA
HQQ	HQQ format	Fast	New experimental formats

Component	Minimum	Recommended
GPU	None (CPU works for small models)	12 GB VRAM+ for 8B models
RAM	16 GB	32 GB
Disk	20 GB free	NVMe; models 4-50 GB each
Python	3.10 (installer manages)	3.11
OS	Windows 10/11, Linux, macOS 12+	Ubuntu 22.04 LTS

Extension	Purpose
`superboogav2`	Local RAG with ChromaDB
`whisper_stt`	Microphone speech-to-text
`silero_tts`	Offline neural text-to-speech
`multimodal`	Vision models (LLaVA family)
`memoir+`	Persistent memory across sessions
`long_replies`	Bias toward longer outputs
`character_bias`	Steering vectors for personas
`google_translate`	Translate user / bot turns
`coqui_tts`	Coqui XTTS v2 voice cloning
`web_search`	DuckDuckGo / SearXNG integration
`message_intercepter`	Hook for log filtering

Symptom	Cause	Fix
One-click installer fails	Network / proxy	Manual install with explicit `--index-url`
Model OOM at load	VRAM too tight	Lower `max_seq_len`, enable cache quant
ExLlamaV2 fails	flash-attn version	Upgrade flash-attn or disable in advanced
Garbled output	Wrong instruction template	Set explicit "Instruction template" in Parameters tab
Extension import error	Missing pip dep	Run extension's `requirements.txt`
API mode 404s	`--api` flag missing	Restart with `--api`
QLoRA training OOM	Batch size too high	Drop to batch_size=1, gradient_accumulation=8
Slow on llama.cpp	n_gpu_layers wrong	Set to 999 to push everything to GPU

text-generation-webui (oobabooga) Complete Guide (2026): Setup, Loaders, Extensions

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What text-generation-webui Is {#what-it-is}

Loader Selection: When to Use Which {#loaders}

Hardware Requirements {#requirements}

Reading articles is good. Building is better.

Installation: One-Click and Manual {#installation}

One-click (recommended)

Manual install

Docker

The Three UI Modes (Chat, Default, Notebook) {#ui-modes}

Loading Models {#loading-models}

Sampler Presets {#samplers}

Character Cards & Personas {#character-cards}

The OpenAI-Compatible API Extension {#api-extension}

Built-In RAG: superboogav2 {#superbooga}

Multimodal: LLaVA, Llama 3.2 Vision {#multimodal}

Whisper STT and Silero TTS Extensions {#audio}

QLoRA Fine-Tuning in the Training Tab {#training}

SillyTavern Integration {#sillytavern}

Useful Extensions Beyond the Defaults {#extensions}

Long Context Tuning {#long-context}

Tuning Recipes by GPU {#tuning}

RTX 4090 (24 GB) — ExLlamaV2

RTX 3060 12 GB — llama.cpp

Mac M4 Max — llama.cpp Metal

CPU only

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

KoboldCpp Setup Guide

ExLlamaV2 + TabbyAPI Guide

LLM Sampling Parameters Explained

Open WebUI Setup Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI