★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Fundamentals

LLM Sampling Parameters Explained (2026): Temperature, top-p, min-p, DRY, XTC

May 1, 2026
30 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Sampling is the second-most-impactful knob on local LLM output quality, behind only the model itself. Pick the wrong sampler and a 70B model will produce repetitive slop; tune it right and an 8B model can punch above its weight. Yet most users blindly leave temperature at 0.7 and never touch the rest.

This guide explains every modern sampling parameter — what it actually does to the probability distribution, when to use it, and what to set it to. We cover the classics (temperature, top-k, top-p), the modern essentials (min-p, typical-p, mirostat), the 2024-2025 newcomers (DRY, XTC, smoothing factor), and how they compose. At the end you will find ready-to-use presets for chat, code, RAG, JSON, creative writing, and roleplay.

Table of Contents

  1. What Sampling Actually Does
  2. The Sampling Pipeline (Order Matters)
  3. Temperature — The Sharpness Knob
  4. Top-K — The Hard Cutoff
  5. Top-P (Nucleus) — The Cumulative Cutoff
  6. Min-P — The Modern Default
  7. Typical-P — Information-Theoretic Sampling
  8. Smoothing Factor / Quadratic Sampling
  9. Mirostat — Adaptive Sampling
  10. Repetition Penalty
  11. Presence and Frequency Penalty
  12. DRY — Don't Repeat Yourself
  13. XTC — Exclude Top Choices
  14. Beam Search and N-Best
  15. Greedy and Argmax (Temperature 0)
  16. Logit Bias and Banned Tokens
  17. Constrained Generation: JSON, Grammars
  18. Recommended Presets by Workload
  19. How to Set Samplers in Each Framework
  20. Debugging Sampling Issues
  21. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Sampling Actually Does {#what-sampling-does}

After every forward pass the model emits a vector of logits — one number per vocab token (Llama 3 vocab = 128,256 tokens; GPT-style ~50,000-100,000). Higher logit means "more likely the next token."

To pick the next token we:

  1. Optionally apply logit modifications (repetition penalty, logit bias, frequency penalty).
  2. Optionally apply truncation (top-k, top-p, min-p, typical-p) — set ignored tokens to -∞.
  3. Apply temperature (divide logits by T).
  4. Convert to probabilities via softmax.
  5. Sample from the resulting distribution.

The sampling stack defines a probability distribution over tokens. Different samplers produce different distributions — and different distributions produce dramatically different outputs.

Key insight: sampling is post-hoc. The model already computed the same logits; sampling is essentially free. So pick the best sampler for your workload, not the cheapest.


The Sampling Pipeline (Order Matters) {#pipeline}

Sampler order changes results. Most modern frameworks (llama.cpp, KoboldCpp, ooba) let you reorder. The recommended modern order:

raw logits
    ↓
[Repetition penalty]
    ↓
[DRY]
    ↓
[Top-K]                    (often skipped in 2026)
    ↓
[Top-P / Min-P / Typical-P / Smoothing]
    ↓
[Temperature]
    ↓
[XTC]                      (creative only)
    ↓
softmax → sample

Why temperature near the end: applying it before truncation makes top-p/min-p inconsistent across temperatures. Modern frameworks default to temperature last (after truncation) because it gives more predictable behavior.


Temperature — The Sharpness Knob {#temperature}

softmax(logits / T)
TEffect
0.0Greedy — always pick argmax (deterministic)
0.2Very focused, factual, repetitive risk
0.5Conservative chat, technical answers
0.7Default chat balance
1.0Raw distribution, more diverse
1.3Creative, occasional weirdness
1.5+Heavily creative; needs strong truncation
2.0+Chaotic; rarely useful

Temperature 0 is not the same as "best quality." It is the most likely single token at every step, but greedy decoding accumulates errors over long outputs. Temperature 0.5-0.7 with min-p 0.05 generally beats greedy on long outputs.

Temperature interacts strongly with truncation samplers — see Min-P below.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Top-K — The Hard Cutoff {#top-k}

Keep only the K highest-logit tokens; set the rest to -∞.

KEffect
1Equivalent to greedy
10Very narrow
40Classic default
100Loose
0Disabled (no top-k)

Top-k's problem: it does not adapt to the distribution. K=40 might cut too aggressively at a peaky step (where 5 tokens already cover 99%) and not enough at a flat step (where 40 tokens still includes garbage).

Modern recommendation: disable top-k (set to 0) and use min-p instead. Top-k is legacy.


Top-P (Nucleus) — The Cumulative Cutoff {#top-p}

Sort tokens by probability, descending. Keep tokens until their cumulative probability ≥ P. Throw away the rest.

PEffect
0.5Tight — only top half
0.9Default
0.95Looser, allows more variety
1.0Disabled (keep everything)

Top-p was the standard from 2019-2023. It is still the default in OpenAI's API. But it has a subtle problem at non-default temperatures: temperature changes the distribution, but top-p is computed on the post-temperature distribution, so the truncation is not stable across temperature values.

Modern recommendation: keep top-p around 0.9-0.95 if your framework requires a value, but pair it with min-p as the primary truncation sampler.


Min-P — The Modern Default {#min-p}

Min-P (introduced 2023, popularized 2024) keeps tokens whose probability is at least min_p × p_top, where p_top is the highest-probability token.

keep token i if p_i >= min_p * max(p)
min_pEffect
0.0Disabled
0.02Very loose
0.05Recommended default
0.1Tight
0.2Very tight, near-greedy

Why min-p wins: it adapts to model confidence. When the model is sure (peaky distribution, top token at 90%), min-p 0.05 only keeps tokens above 4.5% — automatically tight. When the model is uncertain (flat distribution, top token at 5%), min-p 0.05 keeps anything above 0.25% — automatically loose. Top-p does the opposite: tight when confused, loose when confident.

Composition with temperature: min-p is computed on raw probabilities (after softmax), so it stays meaningful at any temperature. This makes high-temperature creative sampling (T=1.3-1.5) usable.

In 2026, min-p 0.05 + temperature 0.7-1.0 is the most common modern preset.


Typical-P — Information-Theoretic Sampling {#typical-p}

Typical sampling (2022) keeps tokens whose information content is close to the expected information content of the distribution. Mathematically:

keep token i if |H(p) - log(1/p_i)| is small

where H(p) is entropy.

In practice it behaves similarly to min-p — adapts to distribution shape, avoids both overly-confident and overly-flat outputs. Less commonly used than min-p in 2026 because min-p is simpler and produces similar results.

typical_pEffect
0.5Tight
0.95Default
1.0Disabled

Smoothing Factor / Quadratic Sampling {#smoothing}

A 2024 sampler that flattens or sharpens the distribution non-linearly. Two parameters:

  • smoothing_factor (typical 0.0-3.0) — sharpens distribution as a quadratic curve.
  • smoothing_curve (typical 1.0-2.0) — exponent.

Effect: reduces middle-probability tokens more aggressively than temperature, while preserving the top probabilities. Useful when temperature alone produces bland or repetitive output.

Less common than min-p but supported in oobabooga and KoboldCpp.


Mirostat — Adaptive Sampling {#mirostat}

Mirostat (2020) targets a constant "surprise" level (Shannon information of each generated token) rather than a fixed truncation. It uses a feedback loop:

mu_t+1 = mu_t - eta * (S_t - tau)

where S_t is observed surprise and tau is target surprise (typical 5.0).

Two variants: Mirostat 1 (per-token feedback) and Mirostat 2 (simpler, more common).

ParameterDefaultPurpose
mirostat0 (off), 1 (v1), 2 (v2)Variant
mirostat_tau5.0Target surprise — lower = tighter
mirostat_eta0.1Learning rate

When to use: if you want consistent quality across very long outputs (e.g., 4K+ token generations) and find that min-p drifts. Otherwise, min-p is simpler and matches mirostat in most short/medium outputs.

When mirostat is on, disable top-k, top-p, min-p, typical-p — mirostat replaces them.


Repetition Penalty {#repetition-penalty}

The original repetition control (llama.cpp, 2023). Divide logit of any recently-seen token by a multiplier:

logit_i = logit_i / penalty   (if token i in last N tokens)
penaltyEffect
1.0Disabled
1.05Mild
1.10Standard
1.15Strong
1.30+Causes weirdness — model picks rare synonyms

Window: typically last 64-256 tokens (repeat_last_n).

Problem: penalizes legitimate repetition (e.g., variable names in code, "the" in prose). DRY is the modern improvement.


Presence and Frequency Penalty {#presence-frequency}

OpenAI-style. Operate on logits directly:

logit_i -= presence_penalty                    (if token i appeared at all)
logit_i -= frequency_penalty * count(token_i)  (proportional to count)
ParameterRangeDefault
presence_penalty-2.0 to 2.00.0
frequency_penalty-2.0 to 2.00.0

Use cases: OpenAI-compatible APIs where these are the only repetition controls. For local models with full sampler access, use DRY or repetition_penalty instead — they are more nuanced.


DRY — Don't Repeat Yourself {#dry}

DRY (introduced 2024 by p-e-w) penalizes multi-token repeats from the prompt or prior output. Unlike repetition_penalty, it scales the penalty exponentially with match length.

penalty = dry_multiplier * dry_base ** (match_length - dry_allowed_length)
ParameterDefaultPurpose
dry_multiplier0.8Overall strength (0 = off)
dry_base1.75Exponential base
dry_allowed_length2Free repetition under this length
dry_sequence_breakers["\n", ":", """, "*"]Tokens that reset the matcher
dry_penalty_last_n0 (all context)Window

Why DRY beats repetition_penalty: it specifically targets phrase-level loops ("I am sorry, but I cannot... I am sorry, but I cannot...") without penalizing common short tokens. Code generation works fine because DRY only kicks in past allowed_length tokens of exact match.

Recommended starting values: the defaults (0.8 / 1.75 / 2) work for most chat workloads. Increase dry_multiplier to 1.0+ for stubborn loops.

Supported in llama.cpp (and Ollama via PARAMETER), KoboldCpp, oobabooga, SillyTavern, Aphrodite Engine.


XTC — Exclude Top Choices {#xtc}

XTC (2024) intentionally removes the highest-probability tokens at each step with some probability — pushing the model toward less obvious choices.

if random() < xtc_probability:
    remove all tokens with prob > xtc_threshold (except the lowest such)
ParameterDefaultPurpose
xtc_threshold0.1Tokens above this are candidates for exclusion
xtc_probability0.5Chance per step that XTC fires

Effect: outputs are noticeably more creative, less clichéd. Especially good for fiction, roleplay, and brainstorming.

Do not use for: code, JSON, math, factual answers, exact reproductions. XTC will remove the correct token and force a wrong one.

Supported in oobabooga, SillyTavern, KoboldCpp. Not in vLLM as of 2026.


Beam search keeps the top-N partial sequences at every step and picks the best at the end. Quality is often higher (more "thoughtful") but throughput drops linearly with beam width.

Beam widthQualityThroughput cost
1Greedy1x
4Solid improvement4x
8Marginal further improvement8x
16+Diminishing returns16x+

When to use: translation, summarization, exact-format generation. Avoid for chat (kills latency) and creative writing (produces bland, "safe" outputs).

vLLM supports beam search via use_beam_search=True. llama.cpp removed beam search in 2024 because it was unmaintained.


Greedy and Argmax (Temperature 0) {#greedy}

Greedy decoding always picks the argmax token. Equivalent to temperature 0.

Pros: deterministic, fast, ideal for tests and reproducibility. Cons: accumulates errors over long outputs; produces bland text on creative tasks.

Use for: unit-test fixtures, exact regression baselines, code completion at temperature 0, JSON-mode short outputs.

Avoid for: anything longer than a few sentences of prose, multi-turn chat, or creative work.


Logit Bias and Banned Tokens {#logit-bias}

Manually shift logits for specific tokens.

{ "logit_bias": { "12345": -100, "67890": 5.0 } }

Use cases:

  • Ban specific tokens (e.g., model-specific stop tokens that your framework misses).
  • Bias towards a format (e.g., favor newline tokens for outline outputs).
  • Force JSON characters.

Caveat: values like -100 effectively ban; -5 to +5 are nudges. Tokenize your target string first to find the right token IDs.


Constrained Generation: JSON, Grammars {#constrained}

For guaranteed-valid output, constrain the sampler to a grammar.

JSON Schema (vLLM, llama.cpp, Outlines, SGLang)

response_format={
    "type": "json_schema",
    "json_schema": {
        "name": "person",
        "schema": {
            "type": "object",
            "properties": {"name": {"type": "string"}, "age": {"type": "integer"}},
            "required": ["name", "age"],
        },
    },
}

vLLM uses xgrammar (default) or outlines. llama.cpp uses GBNF grammars. Throughput overhead: 5-15%.

GBNF (llama.cpp grammar format)

root   ::= object
object ::= "{" pair ("," pair)* "}"
pair   ::= string ":" value
value  ::= string | number | object | "true" | "false" | "null"
string ::= "\"" [a-zA-Z0-9 ]* "\""
number ::= [0-9]+
./llama-cli -m model.gguf --grammar-file grammar.gbnf -p "Generate a person:"

Constrained sampling beats prompting

Don't beg the model in the prompt to "output JSON" — constrain the sampler. The model can't produce invalid output even if it wanted to, because invalid tokens are masked out.


Factual chat / Q&A

temperature: 0.5
min_p: 0.05
top_p: 0.9
top_k: 0
repetition_penalty: 1.05
dry_multiplier: 0.6
dry_allowed_length: 2

Code generation (deterministic)

temperature: 0.0          # or 0.2 for slight diversity
top_p: 1.0
min_p: 0.0
repetition_penalty: 1.0   # off — code legitimately repeats
dry_multiplier: 0.0       # off

Code generation (creative — variable names, comments)

temperature: 0.6
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.0
dry_multiplier: 0.4
dry_allowed_length: 4

RAG / grounded answer

temperature: 0.3
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.05
dry_multiplier: 0.4

Creative writing / fiction

temperature: 1.1
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.05
dry_multiplier: 0.8
dry_base: 1.75
dry_allowed_length: 2
xtc_threshold: 0.1
xtc_probability: 0.5

Roleplay / chat-fiction

temperature: 1.0
min_p: 0.07
top_p: 0.95
repetition_penalty: 1.07
dry_multiplier: 1.0
dry_base: 1.75
dry_allowed_length: 2
xtc_threshold: 0.1
xtc_probability: 0.4

JSON mode / structured output

temperature: 0.0   # or 0.3 with constrained generation
top_p: 1.0
min_p: 0.0
constrained: true (json_schema)

Brainstorming / ideation

temperature: 1.3
min_p: 0.03
top_p: 0.98
repetition_penalty: 1.05
dry_multiplier: 0.6
xtc_threshold: 0.1
xtc_probability: 0.6

How to Set Samplers in Each Framework {#frameworks}

Ollama (Modelfile)

FROM llama3.1:8b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER min_p 0.05
PARAMETER top_k 0
PARAMETER repeat_penalty 1.05
PARAMETER repeat_last_n 256
PARAMETER mirostat 0

DRY and XTC are not yet exposed in Ollama as of mid-2026; use llama.cpp directly for those. See Ollama Modelfile Guide.

llama.cpp

./llama-cli -m model.gguf \
    --temp 0.7 \
    --top-p 0.9 \
    --min-p 0.05 \
    --top-k 0 \
    --repeat-penalty 1.05 \
    --repeat-last-n 256 \
    --dry-multiplier 0.8 \
    --dry-base 1.75 \
    --dry-allowed-length 2 \
    --xtc-probability 0.5 \
    --xtc-threshold 0.1

vLLM

from vllm import SamplingParams

sp = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    min_p=0.05,
    top_k=-1,                        # -1 means disabled
    presence_penalty=0.0,
    frequency_penalty=0.0,
    repetition_penalty=1.05,
    max_tokens=2048,
)

DRY and XTC are not in vLLM as of 2026 — file an issue or use Aphrodite Engine for both.

OpenAI-compatible HTTP

{
    "model": "...",
    "messages": [...],
    "temperature": 0.7,
    "top_p": 0.9,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "max_tokens": 2048
}

Min-p, DRY, XTC are not in the OpenAI spec — you must use a non-OpenAI client or extension fields.

KoboldCpp / oobabooga

Both expose every sampler in their UI. KoboldCpp's "preset" dropdown ships with curated presets (Balanced, Creative, Precise) that are good starting points.

SillyTavern

The "Sampler" panel in SillyTavern is the most complete UI for sampling — exposes every sampler from llama.cpp, KoboldCpp, and oobabooga, plus saves presets per character. Best frontend for tuning.


Debugging Sampling Issues {#debugging}

SymptomLikely CauseFix
Output loops forever ("I cannot... I cannot...")No DRY, low repetition_penaltyDRY 0.8/1.75/2 or rep_penalty 1.10
Output is bland / repetitive across runsTemperature too low, no min-pT=0.8, min-p 0.05
Output is incoherent / randomTemperature too high without truncationAdd min-p 0.05
Code has wrong syntaxTemperature > 0 or XTC enabledT=0, disable XTC
JSON not validNo constrained generationUse json_schema or GBNF
Model picks rare synonymsrepetition_penalty too highDrop to 1.0-1.05
Output cuts off shortStop tokens or max_tokens hitCheck stop array, raise max_tokens
Different output every run despite T=0Non-deterministic kernelsSet seed and torch.use_deterministic_algorithms(True)

FAQ {#faq}

See answers to common LLM sampling questions below.


Sources: llama.cpp sampling source | DRY paper / discussion | XTC discussion | Min-P paper | Mirostat paper (Basu et al., 2020) | Nucleus Sampling paper (Holtzman et al., 2019) | Internal testing on Llama 3.1, Qwen 2.5, Mistral models.

Related guides on Local AI Master:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Modelfile presets tuned for chat, code, RAG, and creative writing. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators