What is the difference between temperature, top-p, and top-k?

Temperature is a global "sharpness" knob applied to the raw logits before sampling — lower means more deterministic, higher means more random (range typically 0.0-2.0). Top-k truncates the candidate pool to the k highest-probability tokens, ignoring everything else (typical k = 40-100). Top-p (nucleus sampling) keeps the smallest set of tokens whose cumulative probability sums to p (typical p = 0.9-0.95). Combining them is common: temperature 0.7 + top-p 0.9 is a reasonable default for chat. Modern thinking is to drop top-k entirely and rely on min-p instead, because top-k cuts useful low-probability tokens at high temperatures.

Why is min-p better than top-p for most workloads?

Top-p (nucleus) keeps tokens until their cumulative probability hits p. Problem: with a peaky distribution (one obvious token at 95%), top-p barely truncates anything; with a flat distribution, it includes too many garbage tokens. Min-p instead keeps tokens whose probability is at least min-p × the top token's probability — it scales with the model's confidence. min-p = 0.05 means "keep any token at least 5% as likely as the most-likely token." Result: tighter at high-confidence steps, looser at uncertain steps. This adapts to the distribution and produces cleaner output, especially at higher temperatures (1.0-1.5). Most modern presets use min-p instead of top-p.

What does DRY (Don't Repeat Yourself) sampling do?

DRY is a 2024-era sampler that penalizes the model for generating sequences that already appear in its context. Instead of penalizing single tokens (like repetition_penalty does), it detects multi-token sequences from the prompt or prior output and exponentially increases the penalty as the match gets longer. This eliminates whole-phrase repetition (e.g., "I am sorry, but I cannot... I am sorry, but I cannot...") without flattening sentence-level patterns the way repetition_penalty does. Recommended starting values: dry_multiplier=0.8, dry_base=1.75, dry_allowed_length=2. Supported in llama.cpp, KoboldCpp, oobabooga, and SillyTavern.

What is XTC (Exclude Top Choices) and when should I use it?

XTC is a creative-writing sampler that intentionally excludes the highest-probability tokens with some probability — forcing the model away from the "obvious" choice. Parameters: xtc_threshold (skip tokens with probability above this, e.g. 0.1) and xtc_probability (chance per step that XTC fires, e.g. 0.5). Effect: outputs are noticeably more creative / less clichéd, at some cost to coherence. Use XTC for fiction, brainstorming, or roleplay; do NOT use it for code, JSON, math, or factual answers — it will produce wrong tokens. SillyTavern and oobabooga implement it; vLLM does not as of 2026.

Should I use repetition_penalty, presence_penalty, or frequency_penalty?

These three are sometimes confused. Frequency penalty (OpenAI-style) reduces probability of tokens proportional to how many times they've appeared. Presence penalty reduces probability of any token that has appeared at least once (binary). Repetition penalty (the older, llama.cpp-native version) divides logits of recently-seen tokens by a multiplier (typical 1.05-1.15). All three can produce weird outputs at high values — the model starts using rare synonyms. In 2026, prefer DRY for repetition control if the framework supports it. Otherwise: repetition_penalty 1.05-1.10 is a safe baseline; only push higher if you see actual loops.

What is mirostat sampling and is it still useful?

Mirostat is a 2020 sampler that targets a constant "perplexity surprise" rather than a fixed top-k or top-p. It maintains a feedback loop that adjusts truncation in real time so output stays at a target informativeness level (controlled by tau, typical 5.0). It produces more consistent quality across long generations than naive top-p. However, modern alternatives (min-p, especially with DRY) generally match or beat mirostat in practice and are simpler to tune. Mirostat is still supported by llama.cpp/Ollama. Skip it unless you specifically want adaptive tuning across long outputs.

What sampling parameters should I use for code generation?

Code generation needs determinism and precise tokens — flat distributions hurt. Recommended: temperature 0.0-0.3, top-p 0.95, no top-k, min-p 0.0, repetition_penalty 1.0 (off). For exact reproduction (test fixtures, regression tests): temperature 0.0 (greedy). For creative coding (variable names, comments): temperature 0.6-0.8 with min-p 0.05. Disable XTC and lower DRY (allowed_length 4+) — code has legitimate repetition (loops, similar function signatures). For chain-of-thought code reasoning: temperature 0.4 with top-p 0.95, longer max_tokens. Specialized coding models (Qwen 2.5 Coder, DeepSeek Coder) often have published recommended sampling — check the model card.

Does temperature affect speed or just quality?

Temperature does not affect inference speed — the model still computes the same logits; the sampler is sub-millisecond. What does affect speed: beam search (5-10x slower for typical beam widths), n>1 sampling, and constrained generation (JSON schema/grammars add 5-15% overhead). For pure throughput, use temperature 0 (greedy) with no beam search and no constraints — but quality on creative tasks will suffer. The right framing: pick sampling for the workload's quality/diversity needs, not for speed.

LLM Sampling Parameters Explained (2026): Temperature, top-p, min-p, DRY, XTC

Sampling is the second-most-impactful knob on local LLM output quality, behind only the model itself. Pick the wrong sampler and a 70B model will produce repetitive slop; tune it right and an 8B model can punch above its weight. Yet most users blindly leave temperature at 0.7 and never touch the rest.

This guide explains every modern sampling parameter — what it actually does to the probability distribution, when to use it, and what to set it to. We cover the classics (temperature, top-k, top-p), the modern essentials (min-p, typical-p, mirostat), the 2024-2025 newcomers (DRY, XTC, smoothing factor), and how they compose. At the end you will find ready-to-use presets for chat, code, RAG, JSON, creative writing, and roleplay.

What Sampling Actually Does
The Sampling Pipeline (Order Matters)
Temperature — The Sharpness Knob
Top-K — The Hard Cutoff
Top-P (Nucleus) — The Cumulative Cutoff
Min-P — The Modern Default
Typical-P — Information-Theoretic Sampling
Smoothing Factor / Quadratic Sampling
Mirostat — Adaptive Sampling
Repetition Penalty
Presence and Frequency Penalty
DRY — Don't Repeat Yourself
XTC — Exclude Top Choices
Beam Search and N-Best
Greedy and Argmax (Temperature 0)
Logit Bias and Banned Tokens
Constrained Generation: JSON, Grammars
Recommended Presets by Workload
How to Set Samplers in Each Framework
Debugging Sampling Issues
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What Sampling Actually Does {#what-sampling-does}

After every forward pass the model emits a vector of logits — one number per vocab token (Llama 3 vocab = 128,256 tokens; GPT-style ~50,000-100,000). Higher logit means "more likely the next token."

To pick the next token we:

Optionally apply logit modifications (repetition penalty, logit bias, frequency penalty).
Optionally apply truncation (top-k, top-p, min-p, typical-p) — set ignored tokens to -∞.
Apply temperature (divide logits by T).
Convert to probabilities via softmax.
Sample from the resulting distribution.

The sampling stack defines a probability distribution over tokens. Different samplers produce different distributions — and different distributions produce dramatically different outputs.

Key insight: sampling is post-hoc. The model already computed the same logits; sampling is essentially free. So pick the best sampler for your workload, not the cheapest.

The Sampling Pipeline (Order Matters) {#pipeline}

Sampler order changes results. Most modern frameworks (llama.cpp, KoboldCpp, ooba) let you reorder. The recommended modern order:

raw logits
    ↓
[Repetition penalty]
    ↓
[DRY]
    ↓
[Top-K]                    (often skipped in 2026)
    ↓
[Top-P / Min-P / Typical-P / Smoothing]
    ↓
[Temperature]
    ↓
[XTC]                      (creative only)
    ↓
softmax → sample

Why temperature near the end: applying it before truncation makes top-p/min-p inconsistent across temperatures. Modern frameworks default to temperature last (after truncation) because it gives more predictable behavior.

Temperature — The Sharpness Knob {#temperature}

softmax(logits / T)

T	Effect
0.0	Greedy — always pick argmax (deterministic)
0.2	Very focused, factual, repetitive risk
0.5	Conservative chat, technical answers
0.7	Default chat balance
1.0	Raw distribution, more diverse
1.3	Creative, occasional weirdness
1.5+	Heavily creative; needs strong truncation
2.0+	Chaotic; rarely useful

Temperature 0 is not the same as "best quality." It is the most likely single token at every step, but greedy decoding accumulates errors over long outputs. Temperature 0.5-0.7 with min-p 0.05 generally beats greedy on long outputs.

Temperature interacts strongly with truncation samplers — see Min-P below.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Top-K — The Hard Cutoff {#top-k}

Keep only the K highest-logit tokens; set the rest to -∞.

K	Effect
1	Equivalent to greedy
10	Very narrow
40	Classic default
100	Loose
0	Disabled (no top-k)

Top-k's problem: it does not adapt to the distribution. K=40 might cut too aggressively at a peaky step (where 5 tokens already cover 99%) and not enough at a flat step (where 40 tokens still includes garbage).

Modern recommendation: disable top-k (set to 0) and use min-p instead. Top-k is legacy.

Top-P (Nucleus) — The Cumulative Cutoff {#top-p}

Sort tokens by probability, descending. Keep tokens until their cumulative probability ≥ P. Throw away the rest.

P	Effect
0.5	Tight — only top half
0.9	Default
0.95	Looser, allows more variety
1.0	Disabled (keep everything)

Top-p was the standard from 2019-2023. It is still the default in OpenAI's API. But it has a subtle problem at non-default temperatures: temperature changes the distribution, but top-p is computed on the post-temperature distribution, so the truncation is not stable across temperature values.

Modern recommendation: keep top-p around 0.9-0.95 if your framework requires a value, but pair it with min-p as the primary truncation sampler.

Min-P — The Modern Default {#min-p}

Min-P (introduced 2023, popularized 2024) keeps tokens whose probability is at least min_p × p_top, where p_top is the highest-probability token.

keep token i if p_i >= min_p * max(p)

min_p	Effect
0.0	Disabled
0.02	Very loose
0.05	Recommended default
0.1	Tight
0.2	Very tight, near-greedy

Why min-p wins: it adapts to model confidence. When the model is sure (peaky distribution, top token at 90%), min-p 0.05 only keeps tokens above 4.5% — automatically tight. When the model is uncertain (flat distribution, top token at 5%), min-p 0.05 keeps anything above 0.25% — automatically loose. Top-p does the opposite: tight when confused, loose when confident.

Composition with temperature: min-p is computed on raw probabilities (after softmax), so it stays meaningful at any temperature. This makes high-temperature creative sampling (T=1.3-1.5) usable.

In 2026, min-p 0.05 + temperature 0.7-1.0 is the most common modern preset.

Typical-P — Information-Theoretic Sampling {#typical-p}

Typical sampling (2022) keeps tokens whose information content is close to the expected information content of the distribution. Mathematically:

keep token i if |H(p) - log(1/p_i)| is small

where H(p) is entropy.

In practice it behaves similarly to min-p — adapts to distribution shape, avoids both overly-confident and overly-flat outputs. Less commonly used than min-p in 2026 because min-p is simpler and produces similar results.

typical_p	Effect
0.5	Tight
0.95	Default
1.0	Disabled

Smoothing Factor / Quadratic Sampling {#smoothing}

A 2024 sampler that flattens or sharpens the distribution non-linearly. Two parameters:

smoothing_factor (typical 0.0-3.0) — sharpens distribution as a quadratic curve.
smoothing_curve (typical 1.0-2.0) — exponent.

Effect: reduces middle-probability tokens more aggressively than temperature, while preserving the top probabilities. Useful when temperature alone produces bland or repetitive output.

Less common than min-p but supported in oobabooga and KoboldCpp.

Mirostat — Adaptive Sampling {#mirostat}

Mirostat (2020) targets a constant "surprise" level (Shannon information of each generated token) rather than a fixed truncation. It uses a feedback loop:

mu_t+1 = mu_t - eta * (S_t - tau)

where S_t is observed surprise and tau is target surprise (typical 5.0).

Two variants: Mirostat 1 (per-token feedback) and Mirostat 2 (simpler, more common).

Parameter	Default	Purpose
`mirostat`	0 (off), 1 (v1), 2 (v2)	Variant
`mirostat_tau`	5.0	Target surprise — lower = tighter
`mirostat_eta`	0.1	Learning rate

When to use: if you want consistent quality across very long outputs (e.g., 4K+ token generations) and find that min-p drifts. Otherwise, min-p is simpler and matches mirostat in most short/medium outputs.

When mirostat is on, disable top-k, top-p, min-p, typical-p — mirostat replaces them.

Repetition Penalty {#repetition-penalty}

The original repetition control (llama.cpp, 2023). Divide logit of any recently-seen token by a multiplier:

logit_i = logit_i / penalty   (if token i in last N tokens)

penalty	Effect
1.0	Disabled
1.05	Mild
1.10	Standard
1.15	Strong
1.30+	Causes weirdness — model picks rare synonyms

Window: typically last 64-256 tokens (repeat_last_n).

Problem: penalizes legitimate repetition (e.g., variable names in code, "the" in prose). DRY is the modern improvement.

Presence and Frequency Penalty {#presence-frequency}

OpenAI-style. Operate on logits directly:

logit_i -= presence_penalty                    (if token i appeared at all)
logit_i -= frequency_penalty * count(token_i)  (proportional to count)

Parameter	Range	Default
presence_penalty	-2.0 to 2.0	0.0
frequency_penalty	-2.0 to 2.0	0.0

Use cases: OpenAI-compatible APIs where these are the only repetition controls. For local models with full sampler access, use DRY or repetition_penalty instead — they are more nuanced.

DRY — Don't Repeat Yourself {#dry}

DRY (introduced 2024 by p-e-w) penalizes multi-token repeats from the prompt or prior output. Unlike repetition_penalty, it scales the penalty exponentially with match length.

penalty = dry_multiplier * dry_base ** (match_length - dry_allowed_length)

Parameter	Default	Purpose
`dry_multiplier`	0.8	Overall strength (0 = off)
`dry_base`	1.75	Exponential base
`dry_allowed_length`	2	Free repetition under this length
`dry_sequence_breakers`	["\n", ":", """, "*"]	Tokens that reset the matcher
`dry_penalty_last_n`	0 (all context)	Window

Why DRY beats repetition_penalty: it specifically targets phrase-level loops ("I am sorry, but I cannot... I am sorry, but I cannot...") without penalizing common short tokens. Code generation works fine because DRY only kicks in past allowed_length tokens of exact match.

Recommended starting values: the defaults (0.8 / 1.75 / 2) work for most chat workloads. Increase dry_multiplier to 1.0+ for stubborn loops.

Supported in llama.cpp (and Ollama via PARAMETER), KoboldCpp, oobabooga, SillyTavern, Aphrodite Engine.

XTC — Exclude Top Choices {#xtc}

XTC (2024) intentionally removes the highest-probability tokens at each step with some probability — pushing the model toward less obvious choices.

if random() < xtc_probability:
    remove all tokens with prob > xtc_threshold (except the lowest such)

Parameter	Default	Purpose
`xtc_threshold`	0.1	Tokens above this are candidates for exclusion
`xtc_probability`	0.5	Chance per step that XTC fires

Effect: outputs are noticeably more creative, less clichéd. Especially good for fiction, roleplay, and brainstorming.

Do not use for: code, JSON, math, factual answers, exact reproductions. XTC will remove the correct token and force a wrong one.

Supported in oobabooga, SillyTavern, KoboldCpp. Not in vLLM as of 2026.

Beam Search and N-Best {#beam-search}

Beam search keeps the top-N partial sequences at every step and picks the best at the end. Quality is often higher (more "thoughtful") but throughput drops linearly with beam width.

Beam width	Quality	Throughput cost
1	Greedy	1x
4	Solid improvement	4x
8	Marginal further improvement	8x
16+	Diminishing returns	16x+

When to use: translation, summarization, exact-format generation. Avoid for chat (kills latency) and creative writing (produces bland, "safe" outputs).

vLLM supports beam search via use_beam_search=True. llama.cpp removed beam search in 2024 because it was unmaintained.

Greedy and Argmax (Temperature 0) {#greedy}

Greedy decoding always picks the argmax token. Equivalent to temperature 0.

Pros: deterministic, fast, ideal for tests and reproducibility. Cons: accumulates errors over long outputs; produces bland text on creative tasks.

Use for: unit-test fixtures, exact regression baselines, code completion at temperature 0, JSON-mode short outputs.

Avoid for: anything longer than a few sentences of prose, multi-turn chat, or creative work.

Logit Bias and Banned Tokens {#logit-bias}

Manually shift logits for specific tokens.

{ "logit_bias": { "12345": -100, "67890": 5.0 } }

Use cases:

Ban specific tokens (e.g., model-specific stop tokens that your framework misses).
Bias towards a format (e.g., favor newline tokens for outline outputs).
Force JSON characters.

Caveat: values like -100 effectively ban; -5 to +5 are nudges. Tokenize your target string first to find the right token IDs.

Constrained Generation: JSON, Grammars {#constrained}

For guaranteed-valid output, constrain the sampler to a grammar.

JSON Schema (vLLM, llama.cpp, Outlines, SGLang)

response_format={
    "type": "json_schema",
    "json_schema": {
        "name": "person",
        "schema": {
            "type": "object",
            "properties": {"name": {"type": "string"}, "age": {"type": "integer"}},
            "required": ["name", "age"],
        },
    },
}

vLLM uses xgrammar (default) or outlines. llama.cpp uses GBNF grammars. Throughput overhead: 5-15%.

GBNF (llama.cpp grammar format)

root   ::= object
object ::= "{" pair ("," pair)* "}"
pair   ::= string ":" value
value  ::= string | number | object | "true" | "false" | "null"
string ::= "\"" [a-zA-Z0-9 ]* "\""
number ::= [0-9]+

./llama-cli -m model.gguf --grammar-file grammar.gbnf -p "Generate a person:"

Constrained sampling beats prompting

Don't beg the model in the prompt to "output JSON" — constrain the sampler. The model can't produce invalid output even if it wanted to, because invalid tokens are masked out.

Recommended Presets by Workload {#presets}

Factual chat / Q&A

temperature: 0.5
min_p: 0.05
top_p: 0.9
top_k: 0
repetition_penalty: 1.05
dry_multiplier: 0.6
dry_allowed_length: 2

Code generation (deterministic)

temperature: 0.0          # or 0.2 for slight diversity
top_p: 1.0
min_p: 0.0
repetition_penalty: 1.0   # off — code legitimately repeats
dry_multiplier: 0.0       # off

Code generation (creative — variable names, comments)

temperature: 0.6
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.0
dry_multiplier: 0.4
dry_allowed_length: 4

RAG / grounded answer

temperature: 0.3
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.05
dry_multiplier: 0.4

Creative writing / fiction

temperature: 1.1
min_p: 0.05
top_p: 0.95
repetition_penalty: 1.05
dry_multiplier: 0.8
dry_base: 1.75
dry_allowed_length: 2
xtc_threshold: 0.1
xtc_probability: 0.5

Roleplay / chat-fiction

temperature: 1.0
min_p: 0.07
top_p: 0.95
repetition_penalty: 1.07
dry_multiplier: 1.0
dry_base: 1.75
dry_allowed_length: 2
xtc_threshold: 0.1
xtc_probability: 0.4

JSON mode / structured output

temperature: 0.0   # or 0.3 with constrained generation
top_p: 1.0
min_p: 0.0
constrained: true (json_schema)

Brainstorming / ideation

temperature: 1.3
min_p: 0.03
top_p: 0.98
repetition_penalty: 1.05
dry_multiplier: 0.6
xtc_threshold: 0.1
xtc_probability: 0.6

How to Set Samplers in Each Framework {#frameworks}

Ollama (Modelfile)

FROM llama3.1:8b-instruct-q4_K_M
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER min_p 0.05
PARAMETER top_k 0
PARAMETER repeat_penalty 1.05
PARAMETER repeat_last_n 256
PARAMETER mirostat 0

DRY and XTC are not yet exposed in Ollama as of mid-2026; use llama.cpp directly for those. See Ollama Modelfile Guide.

llama.cpp

./llama-cli -m model.gguf \
    --temp 0.7 \
    --top-p 0.9 \
    --min-p 0.05 \
    --top-k 0 \
    --repeat-penalty 1.05 \
    --repeat-last-n 256 \
    --dry-multiplier 0.8 \
    --dry-base 1.75 \
    --dry-allowed-length 2 \
    --xtc-probability 0.5 \
    --xtc-threshold 0.1

vLLM

from vllm import SamplingParams

sp = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    min_p=0.05,
    top_k=-1,                        # -1 means disabled
    presence_penalty=0.0,
    frequency_penalty=0.0,
    repetition_penalty=1.05,
    max_tokens=2048,
)

DRY and XTC are not in vLLM as of 2026 — file an issue or use Aphrodite Engine for both.

OpenAI-compatible HTTP

{
    "model": "...",
    "messages": [...],
    "temperature": 0.7,
    "top_p": 0.9,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "max_tokens": 2048
}

Min-p, DRY, XTC are not in the OpenAI spec — you must use a non-OpenAI client or extension fields.

KoboldCpp / oobabooga

Both expose every sampler in their UI. KoboldCpp's "preset" dropdown ships with curated presets (Balanced, Creative, Precise) that are good starting points.

SillyTavern

The "Sampler" panel in SillyTavern is the most complete UI for sampling — exposes every sampler from llama.cpp, KoboldCpp, and oobabooga, plus saves presets per character. Best frontend for tuning.

Debugging Sampling Issues {#debugging}

Symptom	Likely Cause	Fix
Output loops forever ("I cannot... I cannot...")	No DRY, low repetition_penalty	DRY 0.8/1.75/2 or rep_penalty 1.10
Output is bland / repetitive across runs	Temperature too low, no min-p	T=0.8, min-p 0.05
Output is incoherent / random	Temperature too high without truncation	Add min-p 0.05
Code has wrong syntax	Temperature > 0 or XTC enabled	T=0, disable XTC
JSON not valid	No constrained generation	Use json_schema or GBNF
Model picks rare synonyms	repetition_penalty too high	Drop to 1.0-1.05
Output cuts off short	Stop tokens or max_tokens hit	Check stop array, raise max_tokens
Different output every run despite T=0	Non-deterministic kernels	Set seed and `torch.use_deterministic_algorithms(True)`

FAQ {#faq}

See answers to common LLM sampling questions below.

Related guides on Local AI Master:

LLM Sampling Parameters Explained (2026): Temperature, top-p, min-p, DRY, XTC

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Sampling Actually Does {#what-sampling-does}

The Sampling Pipeline (Order Matters) {#pipeline}

Temperature — The Sharpness Knob {#temperature}

Reading articles is good. Building is better.

Top-K — The Hard Cutoff {#top-k}

Top-P (Nucleus) — The Cumulative Cutoff {#top-p}

Min-P — The Modern Default {#min-p}

Typical-P — Information-Theoretic Sampling {#typical-p}

Smoothing Factor / Quadratic Sampling {#smoothing}

Mirostat — Adaptive Sampling {#mirostat}

Repetition Penalty {#repetition-penalty}

Presence and Frequency Penalty {#presence-frequency}

DRY — Don't Repeat Yourself {#dry}

XTC — Exclude Top Choices {#xtc}

Beam Search and N-Best {#beam-search}

Greedy and Argmax (Temperature 0) {#greedy}

Logit Bias and Banned Tokens {#logit-bias}

Constrained Generation: JSON, Grammars {#constrained}

JSON Schema (vLLM, llama.cpp, Outlines, SGLang)

GBNF (llama.cpp grammar format)

Constrained sampling beats prompting

Recommended Presets by Workload {#presets}

Factual chat / Q&A

Code generation (deterministic)

Code generation (creative — variable names, comments)

RAG / grounded answer

Creative writing / fiction

Roleplay / chat-fiction

JSON mode / structured output

Brainstorming / ideation

How to Set Samplers in Each Framework {#frameworks}

Ollama (Modelfile)

llama.cpp

vLLM

OpenAI-compatible HTTP

KoboldCpp / oobabooga

SillyTavern

Debugging Sampling Issues {#debugging}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Quantization Explained

Context Windows Explained

CUDA Optimization for Local LLMs

vLLM Complete Setup Guide

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI