★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Tools

KoboldCpp Setup Guide (2026): One-Binary Local LLM Server with UI

May 1, 2026
26 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

KoboldCpp is the all-in-one local AI appliance — a single executable that runs your LLM, ships a usable web UI, speaks the OpenAI API, generates images via Stable Diffusion, transcribes audio with Whisper, synthesizes speech with OuteTTS, exposes embeddings, and supports every modern sampler (including DRY and XTC). No install, no Python, no Docker required. Download one file and run it.

This guide covers everything: installation across Windows / Linux / macOS, GPU acceleration paths (CUDA, Vulkan, Metal, CPU), the Kobold Lite UI, the OpenAI-compatible API, multimodal pipelines, SillyTavern integration, sampler presets for chat / code / roleplay / creative writing, and tuning for common GPUs.

Table of Contents

  1. What KoboldCpp Is
  2. Why It Beats Ollama for Some Workloads
  3. Hardware Requirements
  4. Installation: Windows, Linux, macOS
  5. Your First Run
  6. GPU Acceleration: CUDA, Vulkan, Metal, CPU
  7. Picking GGUF Quantization
  8. Kobold Lite Web UI
  9. OpenAI-Compatible API
  10. Sampler Reference (with DRY and XTC)
  11. Context Shifting and Long Context
  12. Speculative Decoding (Draft Models)
  13. Vision Models (LLaVA, Llama 3.2 Vision, Qwen 2 VL)
  14. Stable Diffusion Image Generation
  15. Whisper Speech-to-Text
  16. Text-to-Speech (OuteTTS)
  17. Embeddings Server
  18. SillyTavern Integration
  19. Multi-User and Concurrency
  20. Tuning Recipes by GPU
  21. Troubleshooting
  22. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What KoboldCpp Is {#what-it-is}

KoboldCpp (LostRuins/koboldcpp on GitHub) is a fork of llama.cpp packaged as a single executable. It includes:

  • llama.cpp inference engine (same as Ollama under the hood)
  • Kobold Lite — a web chat / story / roleplay UI
  • Three APIs: KoboldAI legacy, OpenAI-compatible, and SSE streaming
  • Stable Diffusion integration (txt2img, img2img, inpainting)
  • Whisper speech-to-text
  • OuteTTS text-to-speech
  • CLIP / vision encoder support for multimodal models
  • Embeddings endpoint compatible with OpenAI /v1/embeddings

All in one ~400 MB binary that needs no Python, no install, no virtualenv.


Why It Beats Ollama for Some Workloads {#vs-ollama}

NeedBetter Choice
Roleplay / creative writingKoboldCpp (DRY, XTC, mirostat, smoothing)
Multimodal in one binaryKoboldCpp (LLM + SD + Whisper + TTS)
Auto model management, simple CLIOllama
Modelfile customizationOllama
Portable, no installKoboldCpp
Fine-grained sampler controlKoboldCpp
Multi-user concurrent servingNeither — use vLLM
Production deploymentOllama (better systemd / Docker support)

For a creative-writing single-user stack, KoboldCpp + SillyTavern is the de facto choice in 2026.


Hardware Requirements {#requirements}

ComponentMinimumRecommended
OSWindows 10, Linux (any modern), macOS 12+Same
RAM8 GB16-32 GB for 8B-14B models
GPU VRAM (NVIDIA / AMD / Intel)4 GB (Q4 1B-3B model)12 GB+
CPU64-bit8-core or better for partial offload
Disk5 GB freeNVMe; models 4-50 GB each
GPU APINone (CPU works)CUDA 11.4+, Vulkan 1.2+, Metal

KoboldCpp has the broadest hardware compatibility of any local LLM server — even integrated GPUs and 10-year-old CPUs run small models.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Installation: Windows, Linux, macOS {#installation}

Windows

Download koboldcpp.exe from github.com/LostRuins/koboldcpp/releases. That's it — no install. Double-click to launch the GUI loader, or run from CLI.

Linux

# Pre-built binary
wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1210
chmod +x koboldcpp-linux-x64-cuda1210
./koboldcpp-linux-x64-cuda1210 --help

# Or build from source for ROCm / latest features
git clone https://github.com/LostRuins/koboldcpp
cd koboldcpp
make LLAMA_HIPBLAS=1   # AMD ROCm
# or:
make LLAMA_CUBLAS=1    # NVIDIA CUDA
# or:
make LLAMA_VULKAN=1    # Vulkan (any GPU)

macOS

# Pre-built ARM binary
wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-mac-arm64
chmod +x koboldcpp-mac-arm64
./koboldcpp-mac-arm64 --help

Metal acceleration is built in.

Docker (community-maintained)

docker run --gpus all -p 5001:5001 \
    -v $(pwd)/models:/models \
    koboldai/koboldcpp:latest \
    --model /models/llama-3.1-8b.gguf --port 5001 --host 0.0.0.0

Your First Run {#first-run}

# Download a model
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Launch with sensible defaults
./koboldcpp \
    --model Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --usecublas \
    --gpulayers 999 \
    --contextsize 8192 \
    --port 5001

Open http://localhost:5001 in your browser. The Kobold Lite UI loads.

For Windows users without CLI: launch koboldcpp.exe and a GUI dialog asks for the model file, GPU acceleration option, and context size.


GPU Acceleration: CUDA, Vulkan, Metal, CPU {#acceleration}

FlagBackendHardware
--usecublasCUDANVIDIA
--usevulkanVulkanNVIDIA, AMD, Intel, Apple (older)
--useclblastCLBlast (deprecated)Older AMD, Intel
--usemetalMetalApple Silicon (auto on Mac)
--usecpuCPU onlyAnything
--gpulayers NLayers on GPU (999 = all)Any GPU backend
--tensorsplit 1,1Split across multi-GPUMulti-GPU rigs

Most users want --usecublas (NVIDIA) or --usevulkan (AMD / Intel / mixed). Metal is auto-detected on Apple Silicon Macs.

For partial GPU offload on a card too small for the full model: set --gpulayers to a number lower than the model's layer count. See CUDA Optimization for layer counts per popular model.


Picking GGUF Quantization {#quantization}

QuantBitsSize (8B)QualityWhen to use
Q2_K~2.53.2 GBBadLast resort, very tight VRAM
Q3_K_M~3.54.0 GBMediocreTight VRAM, prefer IQ3_XXS instead
IQ3_XXS~3.03.5 GBBetter than Q3_K_MTight VRAM
Q4_K_M~4.64.9 GBStandardDefault for most
IQ4_XS~4.34.6 GBSlightly better than Q4_K_MTight VRAM with quality
Q5_K_M~5.75.7 GBHigh qualityPlenty of VRAM
Q6_K~6.66.6 GBNear-losslessQuality-sensitive
Q8_08.58.5 GBEssentially losslessMaximum quality

For Llama 3.1 70B (24 GB RTX 4090 partial-offload reality):

  • Q4_K_M = 42 GB — partial offload ~8 tok/s
  • IQ3_XXS = 30 GB — partial offload ~12 tok/s with better quality than Q3_K_M
  • Q2_K = 25 GB — fits with offload but quality cliff

For 70B fully on a single 24GB GPU, use ExLlamaV2 instead.


Kobold Lite Web UI {#ui}

The built-in UI at http://localhost:5001 has four modes:

  • Story — long-form continuation, no chat formatting.
  • Adventure — text adventure / interactive fiction.
  • Chat — chat with personas; supports world info / character cards.
  • Instruct — single-turn instruction following.

Sidebar features:

  • AI Settings — every sampler exposed
  • Memory / Author's Note / World Info — persistent context injection
  • Image generation — if Stable Diffusion is loaded
  • Voice — if Whisper / TTS are loaded
  • Saved presets — Balanced, Creative, Precise, plus custom

For roleplay-specific features (character cards, lorebooks, advanced UI), pair with SillyTavern instead — see SillyTavern Integration.


OpenAI-Compatible API {#openai-api}

Endpoints mirror OpenAI plus KoboldAI legacy:

EndpointPurpose
POST /v1/chat/completionsOpenAI chat
POST /v1/completionsOpenAI text completion
POST /v1/embeddingsEmbeddings
POST /sdapi/v1/txt2imgStable Diffusion
POST /api/extra/transcribeWhisper STT
POST /api/extra/ttsText-to-speech
GET /api/v1/modelModel info
GET /api/extra/versionServer info

Example:

curl http://localhost:5001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b",
        "messages": [{"role":"user","content":"Hello"}],
        "temperature": 0.7,
        "min_p": 0.05,
        "dry_multiplier": 0.8
    }'

Note the non-OpenAI fields (min_p, dry_multiplier) — KoboldCpp accepts the full sampler stack as extension fields.


Sampler Reference (with DRY and XTC) {#samplers}

KoboldCpp supports every sampler in our LLM Sampling Parameters guide. Key flags / API fields:

{
  "temperature": 0.7,
  "top_k": 0,
  "top_p": 0.9,
  "min_p": 0.05,
  "typical_p": 1.0,
  "smoothing_factor": 0.0,
  "smoothing_curve": 1.0,
  "rep_pen": 1.05,
  "rep_pen_range": 1024,
  "rep_pen_slope": 0.7,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "dynatemp_range": 0.0,
  "dynatemp_exponent": 1.0,
  "mirostat": 0,
  "mirostat_tau": 5.0,
  "mirostat_eta": 0.1,
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "dry_penalty_last_n": 0,
  "dry_sequence_breakers": ["\n", ":", "\"", "*"],
  "xtc_threshold": 0.1,
  "xtc_probability": 0.5,
  "grammar": "..."
}

Built-in UI presets:

  • Balanced — temp 0.7, min_p 0.05, rep_pen 1.05
  • Creative — temp 1.1, min_p 0.05, dry 0.8, xtc 0.5/0.1
  • Precise — temp 0.3, min_p 0.05, no XTC, no DRY

Context Shifting and Long Context {#context-shift}

KoboldCpp's "Context Shifting" feature reuses KV cache when the conversation grows past max_seq_len — it drops the oldest tokens and shifts remaining KV cache without full re-prefill. This makes long-running chat sessions much faster than vanilla llama.cpp.

./koboldcpp \
    --model model.gguf \
    --contextsize 32768 \
    --noshift              # disable context shifting (rarely needed)

For RoPE-scaled extension on models that need it (older Llama 2 to 16K, etc.):

./koboldcpp \
    --ropeconfig 0.5 32000   # rope_freq_scale rope_freq_base

Speculative Decoding (Draft Models) {#speculative}

./koboldcpp \
    --model llama-3.1-70b.Q4_K_M.gguf \
    --draftmodel llama-3.2-1b.Q4_K_M.gguf \
    --draftgpulayers 999 \
    --draftamount 8

Pair models with the same tokenizer (Llama 3.1 70B + Llama 3.2 1B = same vocab). Expected speedup: 1.5-2.0x at single-user batch size 1. See CUDA Optimization for theory.


Vision Models (LLaVA, Llama 3.2 Vision, Qwen 2 VL) {#vision}

./koboldcpp \
    --model llama-3.2-11b-vision.Q4_K_M.gguf \
    --mmproj llama-3.2-11b-vision-mmproj.gguf \
    --usecublas --gpulayers 999

The --mmproj flag points to the multimodal projector file (CLIP-style image encoder). In Kobold Lite chat mode, drag-and-drop images into the conversation.

Compatible vision models: Llama 3.2 11B / 90B Vision, Qwen 2 VL, LLaVA-1.5 / 1.6, MiniCPM-V, Pixtral.


Stable Diffusion Image Generation {#image-gen}

./koboldcpp \
    --model llama-3.1-8b.gguf \
    --sdmodel sdxl-base-1.0.safetensors \
    --sdt5xxl t5xxl_fp16.safetensors    # for SD 3.5 / Flux

In Kobold Lite chat, type /img a cyberpunk samurai and the model generates an image inline.

API endpoint:

curl http://localhost:5001/sdapi/v1/txt2img \
    -H "Content-Type: application/json" \
    -d '{"prompt": "a cat", "steps": 25, "cfg_scale": 7.0}'

Supports SD 1.5, SDXL, SD 3.5, Flux Schnell. Memory budget: SD models share VRAM with the LLM, so plan accordingly. For dedicated image-gen workloads use ComfyUI instead.


Whisper Speech-to-Text {#whisper}

./koboldcpp --model llm.gguf --whispermodel ggml-medium.bin

API:

curl http://localhost:5001/api/extra/transcribe \
    -F "file=@audio.wav" -F "model=whisper"

Whisper models: tiny / base / small / medium / large-v3. medium is the sweet spot for English; large-v3 for accuracy or non-English. See Whisper Local Speech-to-Text for the standalone Whisper guide.


Text-to-Speech (OuteTTS) {#tts}

./koboldcpp --model llm.gguf --ttsmodel oute-tts-0.3-1b.gguf

OuteTTS produces natural English voice. Voice cloning via --ttsvoice <reference.wav>. Quality is roughly equivalent to commercial TTS systems for English; non-English is supported but lower quality.


Embeddings Server {#embeddings}

./koboldcpp --model llm.gguf --embeddingsmodel nomic-embed-text-v1.5.Q5_K_M.gguf

OpenAI-compatible /v1/embeddings endpoint. Compatible with Nomic Embed, BGE, GTE, E5, and most modern embedding GGUF models.


SillyTavern Integration {#sillytavern}

SillyTavern is the most-used roleplay / character-chat frontend; KoboldCpp is its preferred backend.

  1. Launch KoboldCpp: ./koboldcpp --model X.gguf --port 5001 --host 0.0.0.0
  2. Install SillyTavern (https://docs.sillytavern.app/installation/)
  3. In SillyTavern: API → "Text Completion" → API type "KoboldCpp" → URL http://localhost:5001
  4. Tick "Streaming"
  5. Save and connect

SillyTavern uses KoboldCpp's full sampler stack and gives you persistent character cards, world info, lorebooks, group chats, and per-character preset overrides.


Multi-User and Concurrency {#multi-user}

KoboldCpp supports parallel processing with --multiuser N:

./koboldcpp --model llm.gguf --multiuser 4 --contextsize 8192

This enables up to 4 concurrent generations sharing the model. Throughput is much lower than vLLM (no PagedAttention), so for >4 concurrent users use vLLM. For 1-4 users on a desktop, KoboldCpp is fine.


Tuning Recipes by GPU {#tuning}

RTX 3060 12 GB

./koboldcpp \
    --model llama-3.1-8b-Q5_K_M.gguf \
    --usecublas --gpulayers 999 \
    --contextsize 16384 \
    --flashattention \
    --quantkv 8

RTX 4090 24 GB

./koboldcpp \
    --model llama-3.1-8b-Q5_K_M.gguf \
    --usecublas --gpulayers 999 \
    --contextsize 32768 \
    --flashattention \
    --quantkv 8

For 70B partial offload:

./koboldcpp \
    --model llama-3.1-70b-IQ3_XXS.gguf \
    --usecublas --gpulayers 65 \
    --contextsize 8192 \
    --flashattention \
    --quantkv 4

Apple M4 Max

./koboldcpp \
    --model llama-3.1-70b-Q4_K_M.gguf \
    --usemetal --gpulayers 999 \
    --contextsize 16384 \
    --flashattention

AMD RX 7900 XTX (Vulkan)

./koboldcpp \
    --model llama-3.1-8b-Q5_K_M.gguf \
    --usevulkan --gpulayers 999 \
    --contextsize 16384 \
    --flashattention

Troubleshooting {#troubleshooting}

SymptomCauseFix
Antivirus flags .exeFalse positive on PyInstallerVerify SHA256 from GitHub release
"Cuda error 2" at loadModel + KV cache > VRAMLower --contextsize or use smaller quant
Very slow on first promptPrompt prefillEnable --flashattention
Output loopsSampling too narrowAdd DRY: --dry_multiplier 0.8
Vulkan crashes mid-genDriver bugUpdate GPU drivers, fall back to CLBlast
Image gen failssdmodel not loadedPass --sdmodel flag explicitly
SillyTavern shows wrong templateTokenizer mismatchSet --prompttemplate llama3
MultiUser lagsKV cache exhaustionLower --contextsize per user

FAQ {#faq}

See answers to common KoboldCpp questions below.


Sources: KoboldCpp GitHub | Kobold Lite UI | SillyTavern docs | Internal benchmarks across NVIDIA, AMD, Apple.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a KoboldCpp + SillyTavern starter pack. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators