LLM Sampling Parameters Explained (2026): Temperature, top-p, min-p, DRY, XTC
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Sampling is the second-most-impactful knob on local LLM output quality, behind only the model itself. Pick the wrong sampler and a 70B model will produce repetitive slop; tune it right and an 8B model can punch above its weight. Yet most users blindly leave temperature at 0.7 and never touch the rest.
This guide explains every modern sampling parameter — what it actually does to the probability distribution, when to use it, and what to set it to. We cover the classics (temperature, top-k, top-p), the modern essentials (min-p, typical-p, mirostat), the 2024-2025 newcomers (DRY, XTC, smoothing factor), and how they compose. At the end you will find ready-to-use presets for chat, code, RAG, JSON, creative writing, and roleplay.
Table of Contents
- What Sampling Actually Does
- The Sampling Pipeline (Order Matters)
- Temperature — The Sharpness Knob
- Top-K — The Hard Cutoff
- Top-P (Nucleus) — The Cumulative Cutoff
- Min-P — The Modern Default
- Typical-P — Information-Theoretic Sampling
- Smoothing Factor / Quadratic Sampling
- Mirostat — Adaptive Sampling
- Repetition Penalty
- Presence and Frequency Penalty
- DRY — Don't Repeat Yourself
- XTC — Exclude Top Choices
- Beam Search and N-Best
- Greedy and Argmax (Temperature 0)
- Logit Bias and Banned Tokens
- Constrained Generation: JSON, Grammars
- Recommended Presets by Workload
- How to Set Samplers in Each Framework
- Debugging Sampling Issues
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Sampling Actually Does {#what-sampling-does}
After every forward pass the model emits a vector of logits — one number per vocab token (Llama 3 vocab = 128,256 tokens; GPT-style ~50,000-100,000). Higher logit means "more likely the next token."
To pick the next token we:
- Optionally apply logit modifications (repetition penalty, logit bias, frequency penalty).
- Optionally apply truncation (top-k, top-p, min-p, typical-p) — set ignored tokens to -∞.
- Apply temperature (divide logits by T).
- Convert to probabilities via softmax.
- Sample from the resulting distribution.
The sampling stack defines a probability distribution over tokens. Different samplers produce different distributions — and different distributions produce dramatically different outputs.
Key insight: sampling is post-hoc. The model already computed the same logits; sampling is essentially free. So pick the best sampler for your workload, not the cheapest.
The Sampling Pipeline (Order Matters) {#pipeline}
Sampler order changes results. Most modern frameworks (llama.cpp, KoboldCpp, ooba) let you reorder. The recommended modern order:
raw logits
↓
[Repetition penalty]
↓
[DRY]
↓
[Top-K] (often skipped in 2026)
↓
[Top-P / Min-P / Typical-P / Smoothing]
↓
[Temperature]
↓
[XTC] (creative only)
↓
softmax → sample
Why temperature near the end: applying it before truncation makes top-p/min-p inconsistent across temperatures. Modern frameworks default to temperature last (after truncation) because it gives more predictable behavior.
Temperature — The Sharpness Knob {#temperature}
softmax(logits / T)
| T | Effect |
|---|---|
| 0.0 | Greedy — always pick argmax (deterministic) |
| 0.2 | Very focused, factual, repetitive risk |
| 0.5 | Conservative chat, technical answers |
| 0.7 | Default chat balance |
| 1.0 | Raw distribution, more diverse |
| 1.3 | Creative, occasional weirdness |
| 1.5+ | Heavily creative; needs strong truncation |
| 2.0+ | Chaotic; rarely useful |
Temperature 0 is not the same as "best quality." It is the most likely single token at every step, but greedy decoding accumulates errors over long outputs. Temperature 0.5-0.7 with min-p 0.05 generally beats greedy on long outputs.
Temperature interacts strongly with truncation samplers — see Min-P below.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Top-K — The Hard Cutoff {#top-k}
Keep only the K highest-logit tokens; set the rest to -∞.
| K | Effect |
|---|---|
| 1 | Equivalent to greedy |
| 10 | Very narrow |
| 40 | Classic default |
| 100 | Loose |
| 0 | Disabled (no top-k) |
Top-k's problem: it does not adapt to the distribution. K=40 might cut too aggressively at a peaky step (where 5 tokens already cover 99%) and not enough at a flat step (where 40 tokens still includes garbage).
Modern recommendation: disable top-k (set to 0) and use min-p instead. Top-k is legacy.
Top-P (Nucleus) — The Cumulative Cutoff {#top-p}
Sort tokens by probability, descending. Keep tokens until their cumulative probability ≥ P. Throw away the rest.
| P | Effect |
|---|---|
| 0.5 | Tight — only top half |
| 0.9 | Default |
| 0.95 | Looser, allows more variety |
| 1.0 | Disabled (keep everything) |
Top-p was the standard from 2019-2023. It is still the default in OpenAI's API. But it has a subtle problem at non-default temperatures: temperature changes the distribution, but top-p is computed on the post-temperature distribution, so the truncation is not stable across temperature values.
Modern recommendation: keep top-p around 0.9-0.95 if your framework requires a value, but pair it with min-p as the primary truncation sampler.
Min-P — The Modern Default {#min-p}
Min-P (introduced 2023, popularized 2024) keeps tokens whose probability is at least min_p × p_top, where p_top is the highest-probability token.
keep token i if p_i >= min_p * max(p)
| min_p | Effect |
|---|---|
| 0.0 | Disabled |
| 0.02 | Very loose |
| 0.05 | Recommended default |
| 0.1 | Tight |
| 0.2 | Very tight, near-greedy |
Why min-p wins: it adapts to model confidence. When the model is sure (peaky distribution, top token at 90%), min-p 0.05 only keeps tokens above 4.5% — automatically tight. When the model is uncertain (flat distribution, top token at 5%), min-p 0.05 keeps anything above 0.25% — automatically loose. Top-p does the opposite: tight when confused, loose when confident.
Composition with temperature: min-p is computed on raw probabilities (after softmax), so it stays meaningful at any temperature. This makes high-temperature creative sampling (T=1.3-1.5) usable.
In 2026, min-p 0.05 + temperature 0.7-1.0 is the most common modern preset.
Typical-P — Information-Theoretic Sampling {#typical-p}
Typical sampling (2022) keeps tokens whose information content is close to the expected information content of the distribution. Mathematically:
keep token i if |H(p) - log(1/p_i)| is small
where H(p) is entropy.
In practice it behaves similarly to min-p — adapts to distribution shape, avoids both overly-confident and overly-flat outputs. Less commonly used than min-p in 2026 because min-p is simpler and produces similar results.
| typical_p | Effect |
|---|---|
| 0.5 | Tight |
| 0.95 | Default |
| 1.0 | Disabled |
Smoothing Factor / Quadratic Sampling {#smoothing}
A 2024 sampler that flattens or sharpens the distribution non-linearly. Two parameters:
smoothing_factor(typical 0.0-3.0) — sharpens distribution as a quadratic curve.smoothing_curve(typical 1.0-2.0) — exponent.
Effect: reduces middle-probability tokens more aggressively than temperature, while preserving the top probabilities. Useful when temperature alone produces bland or repetitive output.
Less common than min-p but supported in oobabooga and KoboldCpp.
Mirostat — Adaptive Sampling {#mirostat}
Mirostat (2020) targets a constant "surprise" level (Shannon information of each generated token) rather than a fixed truncation. It uses a feedback loop:
mu_t+1 = mu_t - eta * (S_t - tau)
where S_t is observed surprise and tau is target surprise (typical 5.0).
Two variants: Mirostat 1 (per-token feedback) and Mirostat 2 (simpler, more common).
| Parameter | Default | Purpose |
|---|---|---|
mirostat | 0 (off), 1 (v1), 2 (v2) | Variant |
mirostat_tau | 5.0 | Target surprise — lower = tighter |
mirostat_eta | 0.1 | Learning rate |
When to use: if you want consistent quality across very long outputs (e.g., 4K+ token generations) and find that min-p drifts. Otherwise, min-p is simpler and matches mirostat in most short/medium outputs.
When mirostat is on, disable top-k, top-p, min-p, typical-p — mirostat replaces them.
Repetition Penalty {#repetition-penalty}
The original repetition control (llama.cpp, 2023). Divide logit of any recently-seen token by a multiplier:
logit_i = logit_i / penalty (if token i in last N tokens)
| penalty | Effect |
|---|---|
| 1.0 | Disabled |
| 1.05 | Mild |
| 1.10 | Standard |
| 1.15 | Strong |
| 1.30+ | Causes weirdness — model picks rare synonyms |
Window: typically last 64-256 tokens (repeat_last_n).
Problem: penalizes legitimate repetition (e.g., variable names in code, "the" in prose). DRY is the modern improvement.
Presence and Frequency Penalty {#presence-frequency}
OpenAI-style. Operate on logits directly:
logit_i -= presence_penalty (if token i appeared at all)
logit_i -= frequency_penalty * count(token_i) (proportional to count)
| Parameter | Range | Default |
|---|---|---|
| presence_penalty | -2.0 to 2.0 | 0.0 |
| frequency_penalty | -2.0 to 2.0 | 0.0 |
Use cases: OpenAI-compatible APIs where these are the only repetition controls. For local models with full sampler access, use DRY or repetition_penalty instead — they are more nuanced.
DRY — Don't Repeat Yourself {#dry}
DRY (introduced 2024 by p-e-w) penalizes multi-token repeats from the prompt or prior output. Unlike repetition_penalty, it scales the penalty exponentially with match length.
penalty = dry_multiplier * dry_base ** (match_length - dry_allowed_length)
| Parameter | Default | Purpose |
|---|---|---|
dry_multiplier | 0.8 | Overall strength (0 = off) |
dry_base | 1.75 | Exponential base |
dry_allowed_length | 2 | Free repetition under this length |
dry_sequence_breakers | ["\n", ":", """, "*"] | Tokens that reset the matcher |
dry_penalty_last_n | 0 (all context) | Window |
Why DRY beats repetition_penalty: it specifically targets phrase-level loops ("I am sorry, but I cannot... I am sorry, but I cannot...") without penalizing common short tokens. Code generation works fine because DRY only kicks in past allowed_length tokens of exact match.
Recommended starting values: the defaults (0.8 / 1.75 / 2) work for most chat workloads. Increase dry_multiplier to 1.0+ for stubborn loops.
Supported in llama.cpp (and Ollama via PARAMETER), KoboldCpp, oobabooga, SillyTavern, Aphrodite Engine.
XTC — Exclude Top Choices {#xtc}
XTC (2024) intentionally removes the highest-probability tokens at each step with some probability — pushing the model toward less obvious choices.
if random() < xtc_probability:
remove all tokens with prob > xtc_threshold (except the lowest such)
| Parameter | Default | Purpose |
|---|---|---|
xtc_threshold | 0.1 | Tokens above this are candidates for exclusion |
xtc_probability | 0.5 | Chance per step that XTC fires |
Effect: outputs are noticeably more creative, less clichéd. Especially good for fiction, roleplay, and brainstorming.
Do not use for: code, JSON, math, factual answers, exact reproductions. XTC will remove the correct token and force a wrong one.
Supported in oobabooga, SillyTavern, KoboldCpp. Not in vLLM as of 2026.
Beam Search and N-Best {#beam-search}
Beam search keeps the top-N partial sequences at every step and picks the best at the end. Quality is often higher (more "thoughtful") but throughput drops linearly with beam width.
| Beam width | Quality | Throughput cost |
|---|---|---|
| 1 | Greedy | 1x |
| 4 | Solid improvement | 4x |
| 8 | Marginal further improvement | 8x |
| 16+ | Diminishing returns | 16x+ |
When to use: translation, summarization, exact-format generation. Avoid for chat (kills latency) and creative writing (produces bland, "safe" outputs).
vLLM supports beam search via use_beam_search=True. llama.cpp removed beam search in 2024 because it was unmaintained.
Greedy and Argmax (Temperature 0) {#greedy}
Greedy decoding always picks the argmax token. Equivalent to temperature 0.
Pros: deterministic, fast, ideal for tests and reproducibility. Cons: accumulates errors over long outputs; produces bland text on creative tasks.
Use for: unit-test fixtures, exact regression baselines, code completion at temperature 0, JSON-mode short outputs.
Avoid for: anything longer than a few sentences of prose, multi-turn chat, or creative work.
Logit Bias and Banned Tokens {#logit-bias}
Manually shift logits for specific tokens.
{ "logit_bias": { "12345": -100, "67890": 5.0 } }
Use cases:
- Ban specific tokens (e.g., model-specific stop tokens that your framework misses).
- Bias towards a format (e.g., favor newline tokens for outline outputs).
- Force JSON characters.
Caveat: values like -100 effectively ban; -5 to +5 are nudges. Tokenize your target string first to find the right token IDs.
Constrained Generation: JSON, Grammars {#constrained}
For guaranteed-valid output, constrain the sampler to a grammar.
JSON Schema (vLLM, llama.cpp, Outlines, SGLang)
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"schema": {
"type": "object",
"properties": {"name": {"type": "string"}, "age": {"type": "integer"}},
"required": ["name", "age"],
},
},
}
vLLM uses xgrammar (default) or outlines. llama.cpp uses GBNF grammars. Throughput overhead: 5-15%.
GBNF (llama.cpp grammar format)
root ::= object
object ::= "{" pair ("," pair)* "}"
pair ::= string ":" value
value ::= string | number | object | "true" | "false" | "null"
string ::= "\"" [a-zA-Z0-9 ]* "\""
number ::= [0-9]+
./llama-cli -m model.gguf --grammar-file grammar.gbnf -p "Generate a person:"
Constrained sampling beats prompting
Don't beg the model in the prompt to "output JSON" — constrain the sampler. The model can't produce invalid output even if it wanted to, because invalid tokens are masked out.
Recommended Presets by Workload {#presets}
Factual chat / Q&A
temperature: 0.5
min_p: 0.05
top_p: 0.9
top_k: 0
repetition_penalty: 1.05
dry_multiplier: 0.6
dry_allowed_length: 2
Code generation (deterministic)
temperature: 0.0 # or 0.2 for slight diversity
top_p: 1.0
min_p: 0.0
repetition_penalty: 1.0 # off — code legitimately repeats
dry_multiplier: 0.0 # off
Code generation (creative — variable names, comments)
temperature: 0.6
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.0
dry_multiplier: 0.4
dry_allowed_length: 4
RAG / grounded answer
temperature: 0.3
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.05
dry_multiplier: 0.4
Creative writing / fiction
temperature: 1.1
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.05
dry_multiplier: 0.8
dry_base: 1.75
dry_allowed_length: 2
xtc_threshold: 0.1
xtc_probability: 0.5
Roleplay / chat-fiction
temperature: 1.0
min_p: 0.07
top_p: 0.95
repetition_penalty: 1.07
dry_multiplier: 1.0
dry_base: 1.75
dry_allowed_length: 2
xtc_threshold: 0.1
xtc_probability: 0.4
JSON mode / structured output
temperature: 0.0 # or 0.3 with constrained generation
top_p: 1.0
min_p: 0.0
constrained: true (json_schema)
Brainstorming / ideation
temperature: 1.3
min_p: 0.03
top_p: 0.98
repetition_penalty: 1.05
dry_multiplier: 0.6
xtc_threshold: 0.1
xtc_probability: 0.6
How to Set Samplers in Each Framework {#frameworks}
Ollama (Modelfile)
FROM llama3.1:8b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER min_p 0.05
PARAMETER top_k 0
PARAMETER repeat_penalty 1.05
PARAMETER repeat_last_n 256
PARAMETER mirostat 0
DRY and XTC are not yet exposed in Ollama as of mid-2026; use llama.cpp directly for those. See Ollama Modelfile Guide.
llama.cpp
./llama-cli -m model.gguf \
--temp 0.7 \
--top-p 0.9 \
--min-p 0.05 \
--top-k 0 \
--repeat-penalty 1.05 \
--repeat-last-n 256 \
--dry-multiplier 0.8 \
--dry-base 1.75 \
--dry-allowed-length 2 \
--xtc-probability 0.5 \
--xtc-threshold 0.1
vLLM
from vllm import SamplingParams
sp = SamplingParams(
temperature=0.7,
top_p=0.9,
min_p=0.05,
top_k=-1, # -1 means disabled
presence_penalty=0.0,
frequency_penalty=0.0,
repetition_penalty=1.05,
max_tokens=2048,
)
DRY and XTC are not in vLLM as of 2026 — file an issue or use Aphrodite Engine for both.
OpenAI-compatible HTTP
{
"model": "...",
"messages": [...],
"temperature": 0.7,
"top_p": 0.9,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"max_tokens": 2048
}
Min-p, DRY, XTC are not in the OpenAI spec — you must use a non-OpenAI client or extension fields.
KoboldCpp / oobabooga
Both expose every sampler in their UI. KoboldCpp's "preset" dropdown ships with curated presets (Balanced, Creative, Precise) that are good starting points.
SillyTavern
The "Sampler" panel in SillyTavern is the most complete UI for sampling — exposes every sampler from llama.cpp, KoboldCpp, and oobabooga, plus saves presets per character. Best frontend for tuning.
Debugging Sampling Issues {#debugging}
| Symptom | Likely Cause | Fix |
|---|---|---|
| Output loops forever ("I cannot... I cannot...") | No DRY, low repetition_penalty | DRY 0.8/1.75/2 or rep_penalty 1.10 |
| Output is bland / repetitive across runs | Temperature too low, no min-p | T=0.8, min-p 0.05 |
| Output is incoherent / random | Temperature too high without truncation | Add min-p 0.05 |
| Code has wrong syntax | Temperature > 0 or XTC enabled | T=0, disable XTC |
| JSON not valid | No constrained generation | Use json_schema or GBNF |
| Model picks rare synonyms | repetition_penalty too high | Drop to 1.0-1.05 |
| Output cuts off short | Stop tokens or max_tokens hit | Check stop array, raise max_tokens |
| Different output every run despite T=0 | Non-deterministic kernels | Set seed and torch.use_deterministic_algorithms(True) |
FAQ {#faq}
See answers to common LLM sampling questions below.
Sources: llama.cpp sampling source | DRY paper / discussion | XTC discussion | Min-P paper | Mirostat paper (Basu et al., 2020) | Nucleus Sampling paper (Holtzman et al., 2019) | Internal testing on Llama 3.1, Qwen 2.5, Mistral models.
Related guides on Local AI Master:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!