What is KoboldCpp and how is it different from Ollama or llama.cpp?

KoboldCpp is a single-file portable distribution of llama.cpp with a built-in web UI (Kobold Lite), full sampler support (DRY, XTC, smoothing factor, mirostat), KoboldAI-compatible API, OpenAI-compatible API, image generation (Stable Diffusion), text-to-speech, embeddings, and Whisper speech-to-text. Compared to Ollama: Kobold ships everything in one .exe / .bin file with no install, exposes more sampling parameters, and includes a usable UI out of the box. Compared to bare llama.cpp: same speed, but with a UI and the broader feature set. Best for: roleplay / creative writing, single-user desktop, and anyone who wants a portable all-in-one LLM appliance.

What hardware does KoboldCpp need?

KoboldCpp runs on almost any modern computer. CUDA (NVIDIA), CLBlast / Vulkan (AMD, Intel, others), Metal (Mac), and CPU-only fallback all work. Minimum: 8 GB RAM, 5 GB free disk for a small model. Recommended: 16 GB RAM, modern GPU with 6 GB+ VRAM. The portable Windows .exe handles all the device detection automatically; pass `--usecublas` for NVIDIA, `--usevulkan` for AMD/Intel, `--usecpu` for CPU-only. There is no separate install step on Windows — download and run.

Does KoboldCpp support DRY and XTC samplers?

Yes — KoboldCpp is one of the most complete sampler implementations available. Full list: temperature, top-k, top-p, min-p, typical-p, smoothing factor, dynamic temperature, tail-free sampling, mirostat (v1/v2), repetition penalty, presence/frequency penalty, DRY (Don't Repeat Yourself), XTC (Exclude Top Choices), grammar (GBNF), and logit bias. The web UI exposes them all with curated presets ("Balanced", "Creative", "Precise"). For sampling theory, see our [LLM Sampling Parameters guide](/blog/llm-sampling-parameters-explained).

How do I use KoboldCpp with SillyTavern?

Launch KoboldCpp with `--port 5001 --host 0.0.0.0`. In SillyTavern, choose API type "Text Completion" → API "KoboldCpp" → URL `http://localhost:5001`. SillyTavern auto-detects the model and exposes the full sampler stack. KoboldCpp + SillyTavern is the most popular roleplay / creative writing stack in 2026 because it combines KoboldCpp's sampler depth with SillyTavern's character / lorebook / world-info system. Both run locally; nothing leaves your machine.

What models can I run with KoboldCpp?

Any GGUF model — Llama 3.1 / 3.3, Qwen 2.5, Mistral, Mixtral, Gemma 2, Phi-4, DeepSeek, Granite, and most fine-tunes published on Hugging Face. Pick a quantization that fits your VRAM: Q4_K_M is the standard default (best balance of size and quality), Q5_K_M for higher quality, Q8_0 for near-lossless, IQ4_XS / IQ3_XXS for very tight VRAM. KoboldCpp also supports vision-language models via clip+gguf pairs (Llama 3.2 Vision, Qwen 2-VL, LLaVA).

Can KoboldCpp do image generation, speech-to-text, and TTS?

Yes — KoboldCpp embeds Stable Diffusion (SD 1.5, SDXL, SD 3.5, Flux) for image generation via the `--sdmodel` flag, Whisper for speech-to-text via `--whispermodel`, and OuteTTS for text-to-speech via `--ttsmodel`. The built-in Kobold Lite UI lets you trigger image gen from chat ("/img a cat") and play TTS responses. This makes KoboldCpp the only mainstream local LLM server with truly all-in-one multimodal in a single binary.

How does KoboldCpp performance compare to Ollama and llama.cpp?

KoboldCpp is built on llama.cpp and uses the same kernels, so single-stream throughput is essentially identical: Llama 3.1 8B Q4_K_M on RTX 4090 hits ~127 tok/s in all three. Differences are operational: Ollama auto-manages models and unloads after timeout (which can hurt latency), bare llama.cpp is the leanest, KoboldCpp adds the UI and feature set without measurable speed cost. For multi-user concurrent serving, none of them are the right choice — use [vLLM](/blog/vllm-complete-setup-guide) or [TensorRT-LLM](/blog/tensorrt-llm-setup-guide) instead.

Is KoboldCpp safe to use on Windows? Are the .exe files trusted?

Yes — KoboldCpp is open source (AGPL-3.0) and the official builds are produced via GitHub Actions from the public repo (LostRuins/koboldcpp). Download only from the official GitHub releases page. Some antivirus software flags PyInstaller-packaged binaries as suspicious — this is a known false positive for KoboldCpp's Windows .exe and you can verify the SHA256 against the release page. For maximum trust, build from source with `make` and the included PyInstaller scripts.

KoboldCpp Setup Guide (2026): One-Binary Local LLM Server with UI

KoboldCpp is the all-in-one local AI appliance — a single executable that runs your LLM, ships a usable web UI, speaks the OpenAI API, generates images via Stable Diffusion, transcribes audio with Whisper, synthesizes speech with OuteTTS, exposes embeddings, and supports every modern sampler (including DRY and XTC). No install, no Python, no Docker required. Download one file and run it.

This guide covers everything: installation across Windows / Linux / macOS, GPU acceleration paths (CUDA, Vulkan, Metal, CPU), the Kobold Lite UI, the OpenAI-compatible API, multimodal pipelines, SillyTavern integration, sampler presets for chat / code / roleplay / creative writing, and tuning for common GPUs.

What KoboldCpp Is
Why It Beats Ollama for Some Workloads
Hardware Requirements
Installation: Windows, Linux, macOS
Your First Run
GPU Acceleration: CUDA, Vulkan, Metal, CPU
Picking GGUF Quantization
Kobold Lite Web UI
OpenAI-Compatible API
Sampler Reference (with DRY and XTC)
Context Shifting and Long Context
Speculative Decoding (Draft Models)
Vision Models (LLaVA, Llama 3.2 Vision, Qwen 2 VL)
Stable Diffusion Image Generation
Whisper Speech-to-Text
Text-to-Speech (OuteTTS)
Embeddings Server
SillyTavern Integration
Multi-User and Concurrency
Tuning Recipes by GPU
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What KoboldCpp Is {#what-it-is}

KoboldCpp (LostRuins/koboldcpp on GitHub) is a fork of llama.cpp packaged as a single executable. It includes:

llama.cpp inference engine (same as Ollama under the hood)
Kobold Lite — a web chat / story / roleplay UI
Three APIs: KoboldAI legacy, OpenAI-compatible, and SSE streaming
Stable Diffusion integration (txt2img, img2img, inpainting)
Whisper speech-to-text
OuteTTS text-to-speech
CLIP / vision encoder support for multimodal models
Embeddings endpoint compatible with OpenAI /v1/embeddings

All in one ~400 MB binary that needs no Python, no install, no virtualenv.

Why It Beats Ollama for Some Workloads {#vs-ollama}

Need	Better Choice
Roleplay / creative writing	KoboldCpp (DRY, XTC, mirostat, smoothing)
Multimodal in one binary	KoboldCpp (LLM + SD + Whisper + TTS)
Auto model management, simple CLI	Ollama
Modelfile customization	Ollama
Portable, no install	KoboldCpp
Fine-grained sampler control	KoboldCpp
Multi-user concurrent serving	Neither — use vLLM
Production deployment	Ollama (better systemd / Docker support)

For a creative-writing single-user stack, KoboldCpp + SillyTavern is the de facto choice in 2026.

Hardware Requirements {#requirements}

Component	Minimum	Recommended
OS	Windows 10, Linux (any modern), macOS 12+	Same
RAM	8 GB	16-32 GB for 8B-14B models
GPU VRAM (NVIDIA / AMD / Intel)	4 GB (Q4 1B-3B model)	12 GB+
CPU	64-bit	8-core or better for partial offload
Disk	5 GB free	NVMe; models 4-50 GB each
GPU API	None (CPU works)	CUDA 11.4+, Vulkan 1.2+, Metal

KoboldCpp has the broadest hardware compatibility of any local LLM server — even integrated GPUs and 10-year-old CPUs run small models.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Installation: Windows, Linux, macOS {#installation}

Windows

Download koboldcpp.exe from github.com/LostRuins/koboldcpp/releases. That's it — no install. Double-click to launch the GUI loader, or run from CLI.

Linux

# Pre-built binary
wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64-cuda1210
chmod +x koboldcpp-linux-x64-cuda1210
./koboldcpp-linux-x64-cuda1210 --help

# Or build from source for ROCm / latest features
git clone https://github.com/LostRuins/koboldcpp
cd koboldcpp
make LLAMA_HIPBLAS=1   # AMD ROCm
# or:
make LLAMA_CUBLAS=1    # NVIDIA CUDA
# or:
make LLAMA_VULKAN=1    # Vulkan (any GPU)

macOS

# Pre-built ARM binary
wget https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-mac-arm64
chmod +x koboldcpp-mac-arm64
./koboldcpp-mac-arm64 --help

Metal acceleration is built in.

Docker (community-maintained)

docker run --gpus all -p 5001:5001 \
    -v $(pwd)/models:/models \
    koboldai/koboldcpp:latest \
    --model /models/llama-3.1-8b.gguf --port 5001 --host 0.0.0.0

Your First Run {#first-run}

# Download a model
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Launch with sensible defaults
./koboldcpp \
    --model Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
    --usecublas \
    --gpulayers 999 \
    --contextsize 8192 \
    --port 5001

Open http://localhost:5001 in your browser. The Kobold Lite UI loads.

For Windows users without CLI: launch koboldcpp.exe and a GUI dialog asks for the model file, GPU acceleration option, and context size.

GPU Acceleration: CUDA, Vulkan, Metal, CPU {#acceleration}

Flag	Backend	Hardware
`--usecublas`	CUDA	NVIDIA
`--usevulkan`	Vulkan	NVIDIA, AMD, Intel, Apple (older)
`--useclblast`	CLBlast (deprecated)	Older AMD, Intel
`--usemetal`	Metal	Apple Silicon (auto on Mac)
`--usecpu`	CPU only	Anything
`--gpulayers N`	Layers on GPU (999 = all)	Any GPU backend
`--tensorsplit 1,1`	Split across multi-GPU	Multi-GPU rigs

Most users want --usecublas (NVIDIA) or --usevulkan (AMD / Intel / mixed). Metal is auto-detected on Apple Silicon Macs.

For partial GPU offload on a card too small for the full model: set --gpulayers to a number lower than the model's layer count. See CUDA Optimization for layer counts per popular model.

Picking GGUF Quantization {#quantization}

Quant	Bits	Size (8B)	Quality	When to use
Q2_K	~2.5	3.2 GB	Bad	Last resort, very tight VRAM
Q3_K_M	~3.5	4.0 GB	Mediocre	Tight VRAM, prefer IQ3_XXS instead
IQ3_XXS	~3.0	3.5 GB	Better than Q3_K_M	Tight VRAM
Q4_K_M	~4.6	4.9 GB	Standard	Default for most
IQ4_XS	~4.3	4.6 GB	Slightly better than Q4_K_M	Tight VRAM with quality
Q5_K_M	~5.7	5.7 GB	High quality	Plenty of VRAM
Q6_K	~6.6	6.6 GB	Near-lossless	Quality-sensitive
Q8_0	8.5	8.5 GB	Essentially lossless	Maximum quality

For Llama 3.1 70B (24 GB RTX 4090 partial-offload reality):

Q4_K_M = 42 GB — partial offload ~8 tok/s
IQ3_XXS = 30 GB — partial offload ~12 tok/s with better quality than Q3_K_M
Q2_K = 25 GB — fits with offload but quality cliff

For 70B fully on a single 24GB GPU, use ExLlamaV2 instead.

Kobold Lite Web UI {#ui}

The built-in UI at http://localhost:5001 has four modes:

Story — long-form continuation, no chat formatting.
Adventure — text adventure / interactive fiction.
Chat — chat with personas; supports world info / character cards.
Instruct — single-turn instruction following.

Sidebar features:

AI Settings — every sampler exposed
Memory / Author's Note / World Info — persistent context injection
Image generation — if Stable Diffusion is loaded
Voice — if Whisper / TTS are loaded
Saved presets — Balanced, Creative, Precise, plus custom

For roleplay-specific features (character cards, lorebooks, advanced UI), pair with SillyTavern instead — see SillyTavern Integration.

OpenAI-Compatible API {#openai-api}

Endpoints mirror OpenAI plus KoboldAI legacy:

Endpoint	Purpose
`POST /v1/chat/completions`	OpenAI chat
`POST /v1/completions`	OpenAI text completion
`POST /v1/embeddings`	Embeddings
`POST /sdapi/v1/txt2img`	Stable Diffusion
`POST /api/extra/transcribe`	Whisper STT
`POST /api/extra/tts`	Text-to-speech
`GET /api/v1/model`	Model info
`GET /api/extra/version`	Server info

Example:

curl http://localhost:5001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b",
        "messages": [{"role":"user","content":"Hello"}],
        "temperature": 0.7,
        "min_p": 0.05,
        "dry_multiplier": 0.8
    }'

Note the non-OpenAI fields (min_p, dry_multiplier) — KoboldCpp accepts the full sampler stack as extension fields.

Sampler Reference (with DRY and XTC) {#samplers}

KoboldCpp supports every sampler in our LLM Sampling Parameters guide. Key flags / API fields:

{
  "temperature": 0.7,
  "top_k": 0,
  "top_p": 0.9,
  "min_p": 0.05,
  "typical_p": 1.0,
  "smoothing_factor": 0.0,
  "smoothing_curve": 1.0,
  "rep_pen": 1.05,
  "rep_pen_range": 1024,
  "rep_pen_slope": 0.7,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "dynatemp_range": 0.0,
  "dynatemp_exponent": 1.0,
  "mirostat": 0,
  "mirostat_tau": 5.0,
  "mirostat_eta": 0.1,
  "dry_multiplier": 0.8,
  "dry_base": 1.75,
  "dry_allowed_length": 2,
  "dry_penalty_last_n": 0,
  "dry_sequence_breakers": ["\n", ":", "\"", "*"],
  "xtc_threshold": 0.1,
  "xtc_probability": 0.5,
  "grammar": "..."
}

Built-in UI presets:

Balanced — temp 0.7, min_p 0.05, rep_pen 1.05
Creative — temp 1.1, min_p 0.05, dry 0.8, xtc 0.5/0.1
Precise — temp 0.3, min_p 0.05, no XTC, no DRY

Context Shifting and Long Context {#context-shift}

KoboldCpp's "Context Shifting" feature reuses KV cache when the conversation grows past max_seq_len — it drops the oldest tokens and shifts remaining KV cache without full re-prefill. This makes long-running chat sessions much faster than vanilla llama.cpp.

./koboldcpp \
    --model model.gguf \
    --contextsize 32768 \
    --noshift              # disable context shifting (rarely needed)

For RoPE-scaled extension on models that need it (older Llama 2 to 16K, etc.):

./koboldcpp \
    --ropeconfig 0.5 32000   # rope_freq_scale rope_freq_base

Speculative Decoding (Draft Models) {#speculative}

./koboldcpp \
    --model llama-3.1-70b.Q4_K_M.gguf \
    --draftmodel llama-3.2-1b.Q4_K_M.gguf \
    --draftgpulayers 999 \
    --draftamount 8

Pair models with the same tokenizer (Llama 3.1 70B + Llama 3.2 1B = same vocab). Expected speedup: 1.5-2.0x at single-user batch size 1. See CUDA Optimization for theory.

Vision Models (LLaVA, Llama 3.2 Vision, Qwen 2 VL) {#vision}

./koboldcpp \
    --model llama-3.2-11b-vision.Q4_K_M.gguf \
    --mmproj llama-3.2-11b-vision-mmproj.gguf \
    --usecublas --gpulayers 999

The --mmproj flag points to the multimodal projector file (CLIP-style image encoder). In Kobold Lite chat mode, drag-and-drop images into the conversation.

Compatible vision models: Llama 3.2 11B / 90B Vision, Qwen 2 VL, LLaVA-1.5 / 1.6, MiniCPM-V, Pixtral.

Stable Diffusion Image Generation {#image-gen}

./koboldcpp \
    --model llama-3.1-8b.gguf \
    --sdmodel sdxl-base-1.0.safetensors \
    --sdt5xxl t5xxl_fp16.safetensors    # for SD 3.5 / Flux

In Kobold Lite chat, type /img a cyberpunk samurai and the model generates an image inline.

API endpoint:

curl http://localhost:5001/sdapi/v1/txt2img \
    -H "Content-Type: application/json" \
    -d '{"prompt": "a cat", "steps": 25, "cfg_scale": 7.0}'

Supports SD 1.5, SDXL, SD 3.5, Flux Schnell. Memory budget: SD models share VRAM with the LLM, so plan accordingly. For dedicated image-gen workloads use ComfyUI instead.

Whisper Speech-to-Text {#whisper}

./koboldcpp --model llm.gguf --whispermodel ggml-medium.bin

API:

curl http://localhost:5001/api/extra/transcribe \
    -F "file=@audio.wav" -F "model=whisper"

Whisper models: tiny / base / small / medium / large-v3. medium is the sweet spot for English; large-v3 for accuracy or non-English. See Whisper Local Speech-to-Text for the standalone Whisper guide.

Text-to-Speech (OuteTTS) {#tts}

./koboldcpp --model llm.gguf --ttsmodel oute-tts-0.3-1b.gguf

OuteTTS produces natural English voice. Voice cloning via --ttsvoice <reference.wav>. Quality is roughly equivalent to commercial TTS systems for English; non-English is supported but lower quality.

Embeddings Server {#embeddings}

./koboldcpp --model llm.gguf --embeddingsmodel nomic-embed-text-v1.5.Q5_K_M.gguf

OpenAI-compatible /v1/embeddings endpoint. Compatible with Nomic Embed, BGE, GTE, E5, and most modern embedding GGUF models.

SillyTavern Integration {#sillytavern}

SillyTavern is the most-used roleplay / character-chat frontend; KoboldCpp is its preferred backend.

Launch KoboldCpp: ./koboldcpp --model X.gguf --port 5001 --host 0.0.0.0
Install SillyTavern (https://docs.sillytavern.app/installation/)
In SillyTavern: API → "Text Completion" → API type "KoboldCpp" → URL http://localhost:5001
Tick "Streaming"
Save and connect

SillyTavern uses KoboldCpp's full sampler stack and gives you persistent character cards, world info, lorebooks, group chats, and per-character preset overrides.

Multi-User and Concurrency {#multi-user}

KoboldCpp supports parallel processing with --multiuser N:

./koboldcpp --model llm.gguf --multiuser 4 --contextsize 8192

This enables up to 4 concurrent generations sharing the model. Throughput is much lower than vLLM (no PagedAttention), so for >4 concurrent users use vLLM. For 1-4 users on a desktop, KoboldCpp is fine.

Tuning Recipes by GPU {#tuning}

RTX 3060 12 GB

./koboldcpp \
    --model llama-3.1-8b-Q5_K_M.gguf \
    --usecublas --gpulayers 999 \
    --contextsize 16384 \
    --flashattention \
    --quantkv 8

RTX 4090 24 GB

./koboldcpp \
    --model llama-3.1-8b-Q5_K_M.gguf \
    --usecublas --gpulayers 999 \
    --contextsize 32768 \
    --flashattention \
    --quantkv 8

For 70B partial offload:

./koboldcpp \
    --model llama-3.1-70b-IQ3_XXS.gguf \
    --usecublas --gpulayers 65 \
    --contextsize 8192 \
    --flashattention \
    --quantkv 4

Apple M4 Max

./koboldcpp \
    --model llama-3.1-70b-Q4_K_M.gguf \
    --usemetal --gpulayers 999 \
    --contextsize 16384 \
    --flashattention

AMD RX 7900 XTX (Vulkan)

./koboldcpp \
    --model llama-3.1-8b-Q5_K_M.gguf \
    --usevulkan --gpulayers 999 \
    --contextsize 16384 \
    --flashattention

Troubleshooting {#troubleshooting}

Symptom	Cause	Fix
Antivirus flags .exe	False positive on PyInstaller	Verify SHA256 from GitHub release
"Cuda error 2" at load	Model + KV cache > VRAM	Lower `--contextsize` or use smaller quant
Very slow on first prompt	Prompt prefill	Enable `--flashattention`
Output loops	Sampling too narrow	Add DRY: `--dry_multiplier 0.8`
Vulkan crashes mid-gen	Driver bug	Update GPU drivers, fall back to CLBlast
Image gen fails	sdmodel not loaded	Pass `--sdmodel` flag explicitly
SillyTavern shows wrong template	Tokenizer mismatch	Set `--prompttemplate llama3`
MultiUser lags	KV cache exhaustion	Lower `--contextsize` per user

FAQ {#faq}

See answers to common KoboldCpp questions below.

Sources: KoboldCpp GitHub | Kobold Lite UI | SillyTavern docs | Internal benchmarks across NVIDIA, AMD, Apple.

Related guides:

KoboldCpp Setup Guide (2026): One-Binary Local LLM Server with UI

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What KoboldCpp Is {#what-it-is}

Why It Beats Ollama for Some Workloads {#vs-ollama}

Hardware Requirements {#requirements}

Reading articles is good. Building is better.

Installation: Windows, Linux, macOS {#installation}

Windows

Linux

macOS

Docker (community-maintained)

Your First Run {#first-run}

GPU Acceleration: CUDA, Vulkan, Metal, CPU {#acceleration}

Picking GGUF Quantization {#quantization}

Kobold Lite Web UI {#ui}

OpenAI-Compatible API {#openai-api}

Sampler Reference (with DRY and XTC) {#samplers}

Context Shifting and Long Context {#context-shift}

Speculative Decoding (Draft Models) {#speculative}

Vision Models (LLaVA, Llama 3.2 Vision, Qwen 2 VL) {#vision}

Stable Diffusion Image Generation {#image-gen}

Whisper Speech-to-Text {#whisper}

Text-to-Speech (OuteTTS) {#tts}

Embeddings Server {#embeddings}

SillyTavern Integration {#sillytavern}

Multi-User and Concurrency {#multi-user}

Tuning Recipes by GPU {#tuning}

RTX 3060 12 GB

RTX 4090 24 GB

Apple M4 Max

AMD RX 7900 XTX (Vulkan)

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Complete Ollama Guide

LLM Sampling Parameters Explained

Best Ollama Clients

Open WebUI Setup Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI