KoboldCpp Setup Guide (2026): One-Binary Local LLM Server with UI
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
KoboldCpp is the all-in-one local AI appliance — a single executable that runs your LLM, ships a usable web UI, speaks the OpenAI API, generates images via Stable Diffusion, transcribes audio with Whisper, synthesizes speech with OuteTTS, exposes embeddings, and supports every modern sampler (including DRY and XTC). No install, no Python, no Docker required. Download one file and run it.
This guide covers everything: installation across Windows / Linux / macOS, GPU acceleration paths (CUDA, Vulkan, Metal, CPU), the Kobold Lite UI, the OpenAI-compatible API, multimodal pipelines, SillyTavern integration, sampler presets for chat / code / roleplay / creative writing, and tuning for common GPUs.
Table of Contents
- What KoboldCpp Is
- Why It Beats Ollama for Some Workloads
- Hardware Requirements
- Installation: Windows, Linux, macOS
- Your First Run
- GPU Acceleration: CUDA, Vulkan, Metal, CPU
- Picking GGUF Quantization
- Kobold Lite Web UI
- OpenAI-Compatible API
- Sampler Reference (with DRY and XTC)
- Context Shifting and Long Context
- Speculative Decoding (Draft Models)
- Vision Models (LLaVA, Llama 3.2 Vision, Qwen 2 VL)
- Stable Diffusion Image Generation
- Whisper Speech-to-Text
- Text-to-Speech (OuteTTS)
- Embeddings Server
- SillyTavern Integration
- Multi-User and Concurrency
- Tuning Recipes by GPU
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What KoboldCpp Is {#what-it-is}
KoboldCpp (LostRuins/koboldcpp on GitHub) is a fork of llama.cpp packaged as a single executable. It includes:
- llama.cpp inference engine (same as Ollama under the hood)
- Kobold Lite — a web chat / story / roleplay UI
- Three APIs: KoboldAI legacy, OpenAI-compatible, and SSE streaming
- Stable Diffusion integration (txt2img, img2img, inpainting)
- Whisper speech-to-text
- OuteTTS text-to-speech
- CLIP / vision encoder support for multimodal models
- Embeddings endpoint compatible with OpenAI
/v1/embeddings
All in one ~400 MB binary that needs no Python, no install, no virtualenv.
Why It Beats Ollama for Some Workloads {#vs-ollama}
| Need | Better Choice |
|---|---|
| Roleplay / creative writing | KoboldCpp (DRY, XTC, mirostat, smoothing) |
| Multimodal in one binary | KoboldCpp (LLM + SD + Whisper + TTS) |
| Auto model management, simple CLI | Ollama |
| Modelfile customization | Ollama |
| Portable, no install | KoboldCpp |
| Fine-grained sampler control | KoboldCpp |
| Multi-user concurrent serving | Neither — use vLLM |
| Production deployment | Ollama (better systemd / Docker support) |
For a creative-writing single-user stack, KoboldCpp + SillyTavern is the de facto choice in 2026.
Hardware Requirements {#requirements}
| Component | Minimum | Recommended |
|---|---|---|
| OS | Windows 10, Linux (any modern), macOS 12+ | Same |
| RAM | 8 GB | 16-32 GB for 8B-14B models |
| GPU VRAM (NVIDIA / AMD / Intel) | 4 GB (Q4 1B-3B model) | 12 GB+ |
| CPU | 64-bit | 8-core or better for partial offload |
| Disk | 5 GB free | NVMe; models 4-50 GB each |
| GPU API | None (CPU works) | CUDA 11.4+, Vulkan 1.2+, Metal |
KoboldCpp has the broadest hardware compatibility of any local LLM server — even integrated GPUs and 10-year-old CPUs run small models.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Installation: Windows, Linux, macOS {#installation}
Windows
Download koboldcpp.exe from github.com/LostRuins/koboldcpp/releases. That's it — no install. Double-click to launch the GUI loader, or run from CLI.
Linux
# Pre-built binary
wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1210
chmod +x koboldcpp-linux-x64-cuda1210
./koboldcpp-linux-x64-cuda1210 --help
# Or build from source for ROCm / latest features
git clone https://github.com/LostRuins/koboldcpp
cd koboldcpp
make LLAMA_HIPBLAS=1 # AMD ROCm
# or:
make LLAMA_CUBLAS=1 # NVIDIA CUDA
# or:
make LLAMA_VULKAN=1 # Vulkan (any GPU)
macOS
# Pre-built ARM binary
wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-mac-arm64
chmod +x koboldcpp-mac-arm64
./koboldcpp-mac-arm64 --help
Metal acceleration is built in.
Docker (community-maintained)
docker run --gpus all -p 5001:5001 \
-v $(pwd)/models:/models \
koboldai/koboldcpp:latest \
--model /models/llama-3.1-8b.gguf --port 5001 --host 0.0.0.0
Your First Run {#first-run}
# Download a model
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Launch with sensible defaults
./koboldcpp \
--model Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--usecublas \
--gpulayers 999 \
--contextsize 8192 \
--port 5001
Open http://localhost:5001 in your browser. The Kobold Lite UI loads.
For Windows users without CLI: launch koboldcpp.exe and a GUI dialog asks for the model file, GPU acceleration option, and context size.
GPU Acceleration: CUDA, Vulkan, Metal, CPU {#acceleration}
| Flag | Backend | Hardware |
|---|---|---|
--usecublas | CUDA | NVIDIA |
--usevulkan | Vulkan | NVIDIA, AMD, Intel, Apple (older) |
--useclblast | CLBlast (deprecated) | Older AMD, Intel |
--usemetal | Metal | Apple Silicon (auto on Mac) |
--usecpu | CPU only | Anything |
--gpulayers N | Layers on GPU (999 = all) | Any GPU backend |
--tensorsplit 1,1 | Split across multi-GPU | Multi-GPU rigs |
Most users want --usecublas (NVIDIA) or --usevulkan (AMD / Intel / mixed). Metal is auto-detected on Apple Silicon Macs.
For partial GPU offload on a card too small for the full model: set --gpulayers to a number lower than the model's layer count. See CUDA Optimization for layer counts per popular model.
Picking GGUF Quantization {#quantization}
| Quant | Bits | Size (8B) | Quality | When to use |
|---|---|---|---|---|
| Q2_K | ~2.5 | 3.2 GB | Bad | Last resort, very tight VRAM |
| Q3_K_M | ~3.5 | 4.0 GB | Mediocre | Tight VRAM, prefer IQ3_XXS instead |
| IQ3_XXS | ~3.0 | 3.5 GB | Better than Q3_K_M | Tight VRAM |
| Q4_K_M | ~4.6 | 4.9 GB | Standard | Default for most |
| IQ4_XS | ~4.3 | 4.6 GB | Slightly better than Q4_K_M | Tight VRAM with quality |
| Q5_K_M | ~5.7 | 5.7 GB | High quality | Plenty of VRAM |
| Q6_K | ~6.6 | 6.6 GB | Near-lossless | Quality-sensitive |
| Q8_0 | 8.5 | 8.5 GB | Essentially lossless | Maximum quality |
For Llama 3.1 70B (24 GB RTX 4090 partial-offload reality):
- Q4_K_M = 42 GB — partial offload ~8 tok/s
- IQ3_XXS = 30 GB — partial offload ~12 tok/s with better quality than Q3_K_M
- Q2_K = 25 GB — fits with offload but quality cliff
For 70B fully on a single 24GB GPU, use ExLlamaV2 instead.
Kobold Lite Web UI {#ui}
The built-in UI at http://localhost:5001 has four modes:
- Story — long-form continuation, no chat formatting.
- Adventure — text adventure / interactive fiction.
- Chat — chat with personas; supports world info / character cards.
- Instruct — single-turn instruction following.
Sidebar features:
- AI Settings — every sampler exposed
- Memory / Author's Note / World Info — persistent context injection
- Image generation — if Stable Diffusion is loaded
- Voice — if Whisper / TTS are loaded
- Saved presets — Balanced, Creative, Precise, plus custom
For roleplay-specific features (character cards, lorebooks, advanced UI), pair with SillyTavern instead — see SillyTavern Integration.
OpenAI-Compatible API {#openai-api}
Endpoints mirror OpenAI plus KoboldAI legacy:
| Endpoint | Purpose |
|---|---|
POST /v1/chat/completions | OpenAI chat |
POST /v1/completions | OpenAI text completion |
POST /v1/embeddings | Embeddings |
POST /sdapi/v1/txt2img | Stable Diffusion |
POST /api/extra/transcribe | Whisper STT |
POST /api/extra/tts | Text-to-speech |
GET /api/v1/model | Model info |
GET /api/extra/version | Server info |
Example:
curl http://localhost:5001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b",
"messages": [{"role":"user","content":"Hello"}],
"temperature": 0.7,
"min_p": 0.05,
"dry_multiplier": 0.8
}'
Note the non-OpenAI fields (min_p, dry_multiplier) — KoboldCpp accepts the full sampler stack as extension fields.
Sampler Reference (with DRY and XTC) {#samplers}
KoboldCpp supports every sampler in our LLM Sampling Parameters guide. Key flags / API fields:
{
"temperature": 0.7,
"top_k": 0,
"top_p": 0.9,
"min_p": 0.05,
"typical_p": 1.0,
"smoothing_factor": 0.0,
"smoothing_curve": 1.0,
"rep_pen": 1.05,
"rep_pen_range": 1024,
"rep_pen_slope": 0.7,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"dynatemp_range": 0.0,
"dynatemp_exponent": 1.0,
"mirostat": 0,
"mirostat_tau": 5.0,
"mirostat_eta": 0.1,
"dry_multiplier": 0.8,
"dry_base": 1.75,
"dry_allowed_length": 2,
"dry_penalty_last_n": 0,
"dry_sequence_breakers": ["\n", ":", "\"", "*"],
"xtc_threshold": 0.1,
"xtc_probability": 0.5,
"grammar": "..."
}
Built-in UI presets:
- Balanced — temp 0.7, min_p 0.05, rep_pen 1.05
- Creative — temp 1.1, min_p 0.05, dry 0.8, xtc 0.5/0.1
- Precise — temp 0.3, min_p 0.05, no XTC, no DRY
Context Shifting and Long Context {#context-shift}
KoboldCpp's "Context Shifting" feature reuses KV cache when the conversation grows past max_seq_len — it drops the oldest tokens and shifts remaining KV cache without full re-prefill. This makes long-running chat sessions much faster than vanilla llama.cpp.
./koboldcpp \
--model model.gguf \
--contextsize 32768 \
--noshift # disable context shifting (rarely needed)
For RoPE-scaled extension on models that need it (older Llama 2 to 16K, etc.):
./koboldcpp \
--ropeconfig 0.5 32000 # rope_freq_scale rope_freq_base
Speculative Decoding (Draft Models) {#speculative}
./koboldcpp \
--model llama-3.1-70b.Q4_K_M.gguf \
--draftmodel llama-3.2-1b.Q4_K_M.gguf \
--draftgpulayers 999 \
--draftamount 8
Pair models with the same tokenizer (Llama 3.1 70B + Llama 3.2 1B = same vocab). Expected speedup: 1.5-2.0x at single-user batch size 1. See CUDA Optimization for theory.
Vision Models (LLaVA, Llama 3.2 Vision, Qwen 2 VL) {#vision}
./koboldcpp \
--model llama-3.2-11b-vision.Q4_K_M.gguf \
--mmproj llama-3.2-11b-vision-mmproj.gguf \
--usecublas --gpulayers 999
The --mmproj flag points to the multimodal projector file (CLIP-style image encoder). In Kobold Lite chat mode, drag-and-drop images into the conversation.
Compatible vision models: Llama 3.2 11B / 90B Vision, Qwen 2 VL, LLaVA-1.5 / 1.6, MiniCPM-V, Pixtral.
Stable Diffusion Image Generation {#image-gen}
./koboldcpp \
--model llama-3.1-8b.gguf \
--sdmodel sdxl-base-1.0.safetensors \
--sdt5xxl t5xxl_fp16.safetensors # for SD 3.5 / Flux
In Kobold Lite chat, type /img a cyberpunk samurai and the model generates an image inline.
API endpoint:
curl http://localhost:5001/sdapi/v1/txt2img \
-H "Content-Type: application/json" \
-d '{"prompt": "a cat", "steps": 25, "cfg_scale": 7.0}'
Supports SD 1.5, SDXL, SD 3.5, Flux Schnell. Memory budget: SD models share VRAM with the LLM, so plan accordingly. For dedicated image-gen workloads use ComfyUI instead.
Whisper Speech-to-Text {#whisper}
./koboldcpp --model llm.gguf --whispermodel ggml-medium.bin
API:
curl http://localhost:5001/api/extra/transcribe \
-F "file=@audio.wav" -F "model=whisper"
Whisper models: tiny / base / small / medium / large-v3. medium is the sweet spot for English; large-v3 for accuracy or non-English. See Whisper Local Speech-to-Text for the standalone Whisper guide.
Text-to-Speech (OuteTTS) {#tts}
./koboldcpp --model llm.gguf --ttsmodel oute-tts-0.3-1b.gguf
OuteTTS produces natural English voice. Voice cloning via --ttsvoice <reference.wav>. Quality is roughly equivalent to commercial TTS systems for English; non-English is supported but lower quality.
Embeddings Server {#embeddings}
./koboldcpp --model llm.gguf --embeddingsmodel nomic-embed-text-v1.5.Q5_K_M.gguf
OpenAI-compatible /v1/embeddings endpoint. Compatible with Nomic Embed, BGE, GTE, E5, and most modern embedding GGUF models.
SillyTavern Integration {#sillytavern}
SillyTavern is the most-used roleplay / character-chat frontend; KoboldCpp is its preferred backend.
- Launch KoboldCpp:
./koboldcpp --model X.gguf --port 5001 --host 0.0.0.0 - Install SillyTavern (https://docs.sillytavern.app/installation/)
- In SillyTavern: API → "Text Completion" → API type "KoboldCpp" → URL
http://localhost:5001 - Tick "Streaming"
- Save and connect
SillyTavern uses KoboldCpp's full sampler stack and gives you persistent character cards, world info, lorebooks, group chats, and per-character preset overrides.
Multi-User and Concurrency {#multi-user}
KoboldCpp supports parallel processing with --multiuser N:
./koboldcpp --model llm.gguf --multiuser 4 --contextsize 8192
This enables up to 4 concurrent generations sharing the model. Throughput is much lower than vLLM (no PagedAttention), so for >4 concurrent users use vLLM. For 1-4 users on a desktop, KoboldCpp is fine.
Tuning Recipes by GPU {#tuning}
RTX 3060 12 GB
./koboldcpp \
--model llama-3.1-8b-Q5_K_M.gguf \
--usecublas --gpulayers 999 \
--contextsize 16384 \
--flashattention \
--quantkv 8
RTX 4090 24 GB
./koboldcpp \
--model llama-3.1-8b-Q5_K_M.gguf \
--usecublas --gpulayers 999 \
--contextsize 32768 \
--flashattention \
--quantkv 8
For 70B partial offload:
./koboldcpp \
--model llama-3.1-70b-IQ3_XXS.gguf \
--usecublas --gpulayers 65 \
--contextsize 8192 \
--flashattention \
--quantkv 4
Apple M4 Max
./koboldcpp \
--model llama-3.1-70b-Q4_K_M.gguf \
--usemetal --gpulayers 999 \
--contextsize 16384 \
--flashattention
AMD RX 7900 XTX (Vulkan)
./koboldcpp \
--model llama-3.1-8b-Q5_K_M.gguf \
--usevulkan --gpulayers 999 \
--contextsize 16384 \
--flashattention
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Antivirus flags .exe | False positive on PyInstaller | Verify SHA256 from GitHub release |
| "Cuda error 2" at load | Model + KV cache > VRAM | Lower --contextsize or use smaller quant |
| Very slow on first prompt | Prompt prefill | Enable --flashattention |
| Output loops | Sampling too narrow | Add DRY: --dry_multiplier 0.8 |
| Vulkan crashes mid-gen | Driver bug | Update GPU drivers, fall back to CLBlast |
| Image gen fails | sdmodel not loaded | Pass --sdmodel flag explicitly |
| SillyTavern shows wrong template | Tokenizer mismatch | Set --prompttemplate llama3 |
| MultiUser lags | KV cache exhaustion | Lower --contextsize per user |
FAQ {#faq}
See answers to common KoboldCpp questions below.
Sources: KoboldCpp GitHub | Kobold Lite UI | SillyTavern docs | Internal benchmarks across NVIDIA, AMD, Apple.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!