Reference · Free · 83 Terms

AI Glossary

Plain-English definitions for the 83 terms you actually need when working with local AI in 2026. Every term links to related concepts and to the full guide where one exists. Search any term, or browse by category — Models, Hardware, Inference, Architecture, Training, Frameworks, Benchmarks, Applications.

📅 Published: May 9, 2026🔄 Last Updated: May 9, 2026✓ Manually Reviewed

83 of 83 terms

AI Agent

Applications

LLM system that uses tools and takes multi-step actions toward a goal.

AI agents loop: model proposes an action (call a tool, write code, browse the web), the runtime executes it, the result feeds back into the model's context, and the loop repeats until the goal is achieved. Top 2026 agent frameworks: Anthropic Claude Code, OpenAI Codex CLI, Cursor Composer, browser-use, AutoGPT successors. Strong agents need tool use, function calling, planning, and reliable termination.

Related: Function Calling / Tool Use

AIME

Benchmarks

American Invitational Mathematics Examination — 15 hard math problems requiring multi-step reasoning.

AIME 2025 contains 15 problems from the American Invitational Mathematics Examination, each requiring multi-step competition-level math. Top 2026 models solve 90%+ — replaced GSM8K which all top models saturated. Strong reasoning signal for math, science, and structured problem-solving capability.

AMD MI300X

Hardware

AMD's datacenter LLM accelerator — 192GB HBM3, beats H100 on inference per dollar.

The MI300X is AMD's 2024 LLM accelerator with 192GB HBM3 and 5.3 TB/s bandwidth. The 192GB capacity lets it serve a Llama 3.1 405B model on 4 cards vs 8x H100. Software has been the bottleneck — ROCm 6.x and vLLM AMD support are now production-viable.

Read the full guide →Related: ROCm, HBM (High Bandwidth Memory)

Apple Silicon

Hardware

Apple's M-series chips with unified memory architecture, well-suited for medium-large LLMs.

Apple Silicon (M1/M2/M3/M4 + Pro/Max/Ultra) uses unified memory accessible by both CPU and GPU. M3 Ultra reaches 192GB unified at 800 GB/s — the cheapest path to running a 70B model on a single quiet desktop. Ideal for inference, less ideal for training due to lower compute throughput than discrete GPUs.

Related: Metal, MLX

ARC-AGI / ARC-AGI-2

Benchmarks

Abstract reasoning puzzles that resist pattern memorization — measures novel problem solving.

ARC (Chollet, 2019) tests abstract visual reasoning on novel puzzles humans solve at ~85%. Until 2024 every LLM scored under 30% — the canonical "AGI gap" benchmark. ARC-AGI-2 (2025-2026) is a harder revision. The 2026 jump to 70%+ from frontier models marks meaningful generalization progress.

Attention

Architecture

Mechanism letting tokens "look at" other tokens in the sequence to compute their representations.

Attention computes a weighted sum of value vectors, where weights come from query-key dot products. Self-attention (queries, keys, values all from the same sequence) is the core operation in transformers. Variants: multi-head attention (multiple parallel attention heads), grouped-query attention (GQA — share keys/values across heads), multi-latent attention (MLA — compress K/V into a low-rank latent).

Automatic1111

Frameworks

The original Stable Diffusion web UI — tabbed interface with the largest extension ecosystem.

Automatic1111 (A1111) is the original SD web UI, with a tabbed interface (txt2img / img2img / extras) and 5,000+ extensions. Best for SDXL / SD 1.5 / Pony work and for users who want a familiar tabbed workflow over node graphs. SD Forge is a faster A1111 fork; ComfyUI is the more flexible alternative.

Read the full guide →Related: ComfyUI

AWQ

Inference

Activation-aware Weight Quantization — INT4 quantization optimized for vLLM/SGLang serving.

AWQ (Activation-aware Weight Quantization) protects the 1% of weights most critical for activation quality, quantizing the rest to INT4. Used by vLLM, SGLang, and TensorRT-LLM for production serving. Slightly lower quality than Q4_K_M but much faster on GPU due to optimized CUDA kernels.

Related: GPTQ, Quantization

Axolotl

Frameworks

YAML-driven LLM fine-tuning framework — wraps Hugging Face TRL with multi-GPU and tested recipes.

Axolotl is a YAML-config fine-tuning toolkit built on Hugging Face TRL. Supports full fine-tuning, LoRA, QLoRA, multi-node distributed training, and includes vetted configs for popular base models. Standard for serious open-weight fine-tuning at the 7B-70B scale.

Related: Unsloth, QLoRA

BF16 (BFloat16)

Hardware

Brain Float 16-bit format — the default training and inference dtype for modern LLMs.

BF16 keeps the same 8-bit exponent as FP32 but truncates the mantissa to 7 bits. The wide exponent makes it numerically stable for training without scaling tricks. Every major frontier LLM in 2024-2026 trains in BF16; inference often quantizes further.

Related: FP8, Quantization

BPE (Byte-Pair Encoding)

Training

Tokenization algorithm that iteratively merges the most frequent character pairs.

BPE (Sennrich et al., 2016) starts from individual bytes and iteratively merges the most frequent adjacent pair into a new token. Continues until the vocabulary reaches the target size. Used by GPT, Llama, Qwen, Mistral, and most modern LLMs. SentencePiece BPE (Google) handles multilingual text without explicit pre-tokenization.

Related: Tokenizer

Chain of Thought (CoT)

Applications

Prompting / training pattern where the model writes intermediate reasoning steps before the answer.

Chain of Thought (Wei et al., 2022) encourages the model to "think step by step" — producing intermediate reasoning before the final answer. Original CoT was prompt-based; modern reasoning models (OpenAI o1, DeepSeek R1, Claude's Adaptive Thinking) explicitly train on long internal CoT during RL.

Claude

Models

Anthropic's family of models — known for safety, long context, and strong coding.

Claude is Anthropic's LLM family. Notable: Claude 3 (Mar 2024), Claude 3.5 Sonnet (Jun 2024), Claude 4 (Mar 2025), Claude 4.5 (Sep 2025), Claude Sonnet 5 (Apr 2026), Claude Opus 4.7 (Apr 2026). Distinguished by Constitutional AI alignment, Adaptive Thinking on Opus 4.7, and the strongest agentic coding scores in 2026.

Read the full guide →

ComfyUI

Frameworks

Node-graph UI for Stable Diffusion / Flux / video — flexible alternative to Automatic1111.

ComfyUI is a node-graph-based UI for diffusion models. Each node is an operation (load checkpoint, encode prompt, sample, decode); workflows are graphs you can save, share, and version. The most flexible local image-gen UI and the only one with first-class Flux Dev, Wan video, and HunyuanVideo support.

Context Window

Inference

Maximum number of tokens a model can attend to at once.

Context window is the cap on input + output token count for a single forward pass. 2026 frontier ranges: GPT-5.5 400K, Claude Sonnet 5 1M, Gemini 3.1 Pro 1M, DeepSeek V4 256K, Qwen3-Coder-Next 256K. Long context is bound by KV cache memory and "lost in the middle" recall degradation. Effective context (where the model still recalls accurately) is often half the nominal window.

Read the full guide →Related: KV Cache

CUDA

Hardware

NVIDIA's parallel computing platform — the de facto standard for GPU AI workloads.

CUDA (Compute Unified Device Architecture) is NVIDIA's programming model for GPUs. Every major LLM framework — PyTorch, vLLM, TensorRT-LLM, Triton — runs on CUDA. CUDA's ecosystem moat is the primary reason NVIDIA dominates AI hardware.

Related: ROCm, Metal

DeepSeek

Models

Chinese AI lab with state-of-the-art open-weight models — V3, R1, V4 family.

DeepSeek is a Chinese AI lab whose open-weight releases have set the open-model frontier. Notable: V2 (May 2024 — introduced MLA), V3 (Dec 2024 — first GPT-4o-class open model), R1 (Jan 2025 — first reasoning model open release), V4 (Apr 2026 — current open frontier). MIT-licensed weights — among the most permissive frontier-class licenses available.

Read the full guide →

Dense Model

Models

A model where every parameter is used for every token (opposite of MoE).

Dense models activate all their parameters on every forward pass. Examples in 2026: Qwen3.6-27B, Llama 4 70B, Mistral Medium 3.5. Pros: simpler training, easier to fine-tune, lower total VRAM. Cons: harder to scale to trillion-parameter regimes without proportional compute increases.

DPO (Direct Preference Optimization)

Training

Preference learning method that optimizes directly on chosen vs rejected pairs without a reward model.

DPO (Rafailov et al., 2023) reformulates RLHF as a simple binary classification on (prompt, chosen, rejected) triples. No separate reward model, no PPO, no rollouts. Easier to train and roughly matches RLHF quality on most preference tasks. The default preference-learning method in 2024-2026.

EAGLE

Inference

A speculative decoding method that uses the target model's second-to-last layer as the draft.

EAGLE replaces the small standalone draft with a single transformer layer that runs on the target's penultimate hidden states. Higher acceptance rate (70-85%) and lower memory overhead than draft-target speculative decoding. EAGLE-2 (2024) adds dynamic tree attention; EAGLE-3 (2025) adds training-aware draft adaptation.

Related: Speculative Decoding, Medusa

Embedding Model

Models

A model that converts text or images into fixed-length numeric vectors.

Embedding models map inputs into a vector space where semantic similarity equals geometric proximity. Critical for RAG, semantic search, and clustering. Top 2026: OpenAI text-embedding-3, Cohere embed-v4, BGE-M3, Voyage-3, Jina-embeddings-v3.

Fine-Tuning

Training

Training a pretrained model further on domain-specific or task-specific data.

Fine-tuning takes a pretrained base model and continues training on smaller, specialized datasets. Methods: full fine-tuning (update all weights — expensive), LoRA / QLoRA (update small adapter matrices — cheap), prompt tuning (update only embedding-space prompts). Use cases: domain specialization, style transfer, instruction following, safety alignment.

FlashAttention

Inference

IO-aware exact-attention algorithm that runs attention in O(N) memory instead of O(N²).

FlashAttention (Dao et al., 2022) tiles the attention computation and keeps intermediate values in GPU SRAM, avoiding materialization of the N×N attention matrix in HBM. Same math as vanilla attention, but 2-4× faster on long context. FA-2 (2023) and FA-3 (2024 Hopper-specific) further improve throughput.

Read the full guide →Related: Attention

FP8

Hardware

8-bit floating point format — halves memory and doubles throughput vs BF16 for inference.

FP8 stores tensors in 8 bits with two variants (E4M3 / E5M2 — different exponent/mantissa splits). H100, H200, MI300X, and RTX 5090 all have native FP8 tensor cores. For LLM inference, FP8 typically loses <1 quality point vs BF16 while doubling effective throughput and halving memory.

Related: BF16 (BFloat16), Quantization

Function Calling / Tool Use

Applications

LLM capability to invoke external functions/APIs in structured form.

Function calling (also called tool use) lets an LLM emit a structured request to call an external function — typically as a JSON object with function name and arguments. The application executes the function and returns the result to the model. The foundation under all modern agent frameworks.

Read the full guide →Related: AI Agent, JSON Mode / Constrained Generation

GDDR

Hardware

Graphics DDR — the memory used in consumer GPUs (RTX 30/40/50 series).

GDDR6 and GDDR6X power the consumer NVIDIA stack. RTX 4090 uses GDDR6X at 1008 GB/s — about 1/3 the bandwidth of H100's HBM3. RTX 5090 uses GDDR7, pushing closer to 1.8 TB/s but still well below datacenter HBM.

Gemini

Models

Google DeepMind's family of models — strong on long context and multimodal.

Gemini is Google DeepMind's LLM family. Notable: Gemini 1.0 (Dec 2023), 1.5 Pro (Feb 2024 — first 1M context), 2.0 Flash (Dec 2024), 2.5 Pro (Mar 2025), 3.1 Pro (Feb 2026). Distinguished by 1M+ context window, native multimodality (vision + audio), and Deep Think mode for high-reasoning problems.

Read the full guide →

GGUF

Inference

Single-file model format used by llama.cpp and Ollama — packs weights, tokenizer, and metadata.

GGUF (GPT-Generated Unified Format) replaces the older GGML format. A GGUF file contains the full quantized weights, tokenizer, prompt template, and architecture metadata in one self-contained binary. Used by llama.cpp, Ollama, KoboldCpp, LM Studio, jan, and most consumer-facing local LLM tools.

Related: Quantization, Ollama

GPT

Models

OpenAI's family of models — Generative Pretrained Transformer.

GPT (Generative Pretrained Transformer) is OpenAI's LLM family. Notable releases: GPT-3 (2020, 175B), GPT-4 (2023), GPT-4o (2024), GPT-4.5 (Feb 2025), GPT-5 (2025), GPT-5.5 (May 2026). All are closed-source, accessed via the OpenAI API or ChatGPT.

Read the full guide →

GPTQ

Inference

Layer-by-layer INT4 quantization that re-optimizes remaining weights to compensate for quantization error.

GPTQ (Generative Pretrained Transformer Quantization) quantizes one layer at a time, using calibration data to update remaining FP weights to compensate for the introduced error. Works well for INT4 and INT3. Largely superseded by AWQ for production serving but still common in the open-weight community.

Related: AWQ, Quantization

GQA (Grouped Query Attention)

Architecture

Attention variant where multiple query heads share a single key/value head.

GQA reduces KV cache memory by grouping query heads to share K and V. A typical 8x reduction (e.g., 32 query heads, 4 KV heads) cuts KV cache by 8× with minimal quality loss. Used in Llama 3+, Qwen 2+, Mistral, and most 2024+ open-weight models.

GSM8K

Benchmarks

Grade-school math word problems — saturated by 2024.

GSM8K (OpenAI, 2021) contains 8.5K grade-school math word problems. Was the standard math benchmark from 2021-2023. By 2024 every frontier model scored 95%+. Replaced by AIME 2025 and MATH-500 for serious benchmarking.

Related: AIME

HBM (High Bandwidth Memory)

Hardware

Stacked memory technology used in datacenter GPUs (H100, MI300X) for very high bandwidth.

HBM (High Bandwidth Memory) is 3D-stacked DRAM connected to the GPU die via a silicon interposer. HBM3 reaches ~3 TB/s on H100 80GB; HBM3e on H200 reaches ~4.8 TB/s. Critical for LLM inference where decode is bandwidth-bound.

Related: VRAM

Hugging Face Transformers

Frameworks

PyTorch-based reference library for LLMs — the de facto standard for research and prototyping.

The Hugging Face Transformers library wraps thousands of models in a uniform PyTorch API. Used by researchers, model creators, and as the underlying training stack for most fine-tuning. Slower than vLLM/SGLang for serving but the most flexible for research and one-off experimentation.

HumanEval / EvalPlus

Benchmarks

Python programming benchmark — 164 hand-written problems with unit tests.

HumanEval (Chen et al., 2021) has 164 hand-written Python programming problems and unit tests. Top 2026 models score 95%+ — saturated. EvalPlus (HumanEval+) adds 80× more test cases per problem to catch models that pass weak tests but produce buggy code. Less discriminative than SWE-Bench but useful as a sanity check.

Related: SWE-Bench

JSON Mode / Constrained Generation

Applications

Forcing the model to output valid JSON / regex / grammar via constrained sampling.

JSON Mode masks the LLM's output distribution at each step to only sample tokens that maintain a valid JSON parse (or any other grammar). Implementations: outlines, Guidance, llama.cpp grammars, vLLM's structured output, OpenAI/Anthropic native JSON modes. Critical for reliable function calling and structured data extraction.

Read the full guide →Related: Function Calling / Tool Use

Knowledge Distillation

Training

Training a small "student" model to mimic a large "teacher" model's outputs.

Distillation compresses a strong teacher into a smaller student. Methods: sequence-level (student trains on teacher's text outputs), logit-level (student matches teacher's probability distribution), hidden-state (student matches teacher's intermediate representations), rationale (student trains on teacher's chain-of-thought). DeepSeek R1-Distill, Qwen-Distill, and most "small but smart" 2024+ open models are distilled.

Read the full guide →Related: Fine-Tuning

KV Cache

Inference

Cached Key/Value tensors from attention — the dominant memory cost in long-context inference.

During autoregressive generation, attention K and V tensors for past tokens are cached to avoid recomputation. KV cache memory: 2 × layers × heads × head_dim × bytes × seq_len per request. Often larger than the model weights themselves at long context. Optimization techniques: PagedAttention, GQA, MLA, KV quantization (FP8 / INT4).

Read the full guide →Related: PagedAttention, GQA (Grouped Query Attention), MLA (Multi-Head Latent Attention)

Llama

Models

Meta's open-weight LLM family — the most-deployed open model in production.

Llama is Meta's LLM family. Notable: Llama 1 (Feb 2023, research-only), Llama 2 (Jul 2023, first commercially-permissive), Llama 3 (Apr 2024), Llama 3.1 (Jul 2024 — 405B flagship), Llama 3.2 (Sep 2024 — multimodal), Llama 4 (Mar 2026). Llama Community License allows commercial use under a 700M MAU threshold.

llama.cpp

Frameworks

C++ implementation of LLM inference — runs everywhere, supports GGUF, the foundation under Ollama and LM Studio.

llama.cpp (Georgi Gerganov) is the canonical CPU/GPU LLM inference library. Pure C++, no PyTorch dependency, runs on Linux/Mac/Windows/Android/iOS, supports CUDA / Metal / ROCm / Vulkan, and uses the GGUF format. The foundation under Ollama, KoboldCpp, LM Studio, jan, and most consumer-facing local AI tools.

Related: GGUF, Ollama

LLM

Models

Large Language Model. A neural network trained on massive text corpora to predict tokens.

Large Language Models are transformer-based neural networks with billions to trillions of parameters trained on text to predict the next token. The largest 2026 models (DeepSeek V4, Claude Opus 4.7, Gemini 3.1 Pro) sit at hundreds of billions of activated parameters with trillions of total parameters in MoE configurations. LLM capabilities scale roughly with the log of compute spent on training and the quality of training data.

LoRA (Low-Rank Adaptation)

Training

Fine-tuning method that updates small low-rank matrices instead of full weights.

LoRA (Hu et al., 2021) freezes pretrained weights W and adds a low-rank update W + ΔW where ΔW = BA, with B and A small matrices (rank typically 8-64). Trains 0.1-1% of full parameters with quality close to full fine-tuning on most tasks. Adapters merge cleanly back into weights at inference time.

Related: QLoRA, Fine-Tuning

Mamba / SSM

Architecture

State-Space Model architecture with linear-time sequence processing — alternative to Transformer attention.

Mamba (Gu & Dao, 2023) is a Selective State-Space Model that processes sequences in O(N) time by encoding all history into a recurrent state. Faster at very long context than Transformers, but slightly weaker on tasks requiring exact recall. Hybrid Mamba+Transformer (e.g., Jamba, Zamba) approaches show promise.

Read the full guide →Related: Transformer

Medusa

Inference

A speculative decoding method that adds parallel "head" prediction layers to the target model.

Medusa attaches multiple parallel prediction heads to the target model's last hidden state, each predicting the K-th future token directly. Unlike EAGLE's sequential drafts, Medusa's heads predict in parallel. Acceptance rate is lower than EAGLE but inference is simpler.

Related: Speculative Decoding, EAGLE

Metal

Hardware

Apple's GPU framework used for AI on Mac M-series chips.

Metal is Apple's low-level GPU API. MLX, llama.cpp, and Ollama use Metal to run LLMs on Apple Silicon. With unified memory architecture, Macs avoid the VRAM/system-RAM split — a Mac Studio with 192GB can serve a 70B model that no single discrete GPU can.

Related: CUDA, Apple Silicon

Mistral

Models

French AI lab — open-weight 7B/12B/123B and Mistral Medium 3.5 unified vision/coding model.

Mistral AI is a French AI lab founded in 2023. Notable releases: Mistral 7B (Sep 2023 — pioneered sliding-window attention in open weights), Mixtral 8x7B (Dec 2023 — first major open MoE), Mistral Large 2 (Jul 2024), Mistral Medium 3.5 (Apr 2026 — unified Magistral reasoning + Pixtral vision + Devstral coding). Mix of Apache 2.0 and Mistral Research licenses depending on model.

Read the full guide →

MLA (Multi-Head Latent Attention)

Architecture

DeepSeek's attention variant that compresses K/V into a low-rank latent representation.

Multi-Head Latent Attention (DeepSeek-V2, 2024) projects K and V into a small latent space, storing only the latent in cache. Achieves 5-10× KV cache reduction vs MHA — competitive with GQA at smaller compression but more flexible. Used in DeepSeek V2, V3, V4 and several 2025+ derivatives.

MLX

Frameworks

Apple's ML framework optimized for Apple Silicon — alternative to PyTorch on Mac.

MLX is Apple's 2023-released ML framework with a NumPy-like API, lazy evaluation, and unified-memory awareness. MLX-LM lets you run 7B-70B models on Apple Silicon at competitive throughput vs llama.cpp Metal. Best for users who want PyTorch-like APIs without leaving the Mac ecosystem.

Related: Apple Silicon, Metal

MMBench / MMMU / MMVet

Benchmarks

Vision-language benchmarks measuring multimodal reasoning across image+text tasks.

MMBench (multi-choice perception + reasoning), MMMU (college-level multimodal QA), MMVet (open-ended response evaluation) are the three standard VLM benchmarks. Top 2026 VLMs (Claude Opus 4.7, Gemini 3.1 Pro, GLM-4.5V, Qwen2-VL 72B) cluster in the 80-90% range across these.

MMLU / MMLU-Pro

Benchmarks

Massive Multitask Language Understanding — multi-choice questions across 57+ disciplines.

MMLU (Hendrycks et al., 2020) covers 57 academic disciplines with 4-choice questions. By 2024 every frontier model scored 85%+ — saturated. MMLU-Pro (TIGER-Lab, 2024) extends with 14 disciplines, 12K questions, and 10-choice answers, restoring discriminative power. The default reasoning benchmark in 2025-2026.

MoE (Mixture of Experts)

Models

Architecture where each token activates only a small fraction of model parameters via a gating network.

Mixture of Experts splits a model into many "expert" subnetworks (typically 64-256 experts) and routes each token to a small subset (usually 8) via a learned gating network. Total parameters can be huge (DeepSeek V4: 1.6T) while activated parameters per token stay small (49B), giving the quality of a large dense model at the inference cost of a smaller one. Trade-offs: more total VRAM, harder to fine-tune, requires balanced routing during training.

Read the full guide →Related: LLM, Dense Model

Multimodal Model

Models

A model that handles multiple input types — text, images, audio, video.

Multimodal models accept and reason about more than one modality. Beyond VLMs, modern multimodal includes audio-language models (text + speech) and any-to-any models like GPT-4o that handle text, images, and audio in unified representations.

NVIDIA H100

Hardware

Datacenter GPU with 80GB HBM3 — workhorse of 2024-2025 LLM training and inference.

The H100 (Hopper) launched in 2022 with 80GB HBM3 (3.35 TB/s), FP8 tensor cores, and the Transformer Engine. From 2023 to 2025 it was the dominant LLM accelerator. Succeeded by H200 (141GB HBM3e) and Blackwell (B100/B200) in 2024-2026.

Ollama

Frameworks

CLI and runtime for running LLMs locally — wraps llama.cpp with a model registry.

Ollama is the most popular local LLM runtime in 2024-2026. It wraps llama.cpp with a Docker-style model registry (`ollama pull llama3`), an OpenAI-compatible API server, and Modelfile-based custom prompts. Default for hobbyist and developer use cases. Less efficient than vLLM for production serving with concurrent users.

Related: llama.cpp, vLLM

PagedAttention

Inference

KV cache management technique that uses fixed-size pages instead of contiguous buffers.

PagedAttention (Kwon et al., 2023, the vLLM paper) allocates KV cache in fixed-size pages (typically 16 tokens) rather than max-sequence buffers. Cuts memory waste by 60-90%, enables 2-4× larger batch sizes, and naturally supports prefix caching, beam search, and dynamic eviction. Default in vLLM, SGLang, and TensorRT-LLM.

Related: KV Cache, vLLM

Q4_K_M

Inference

The most popular llama.cpp quantization — 4-bit weights with K-quants and Mixed precision.

Q4_K_M is a llama.cpp quantization format that uses 4-bit weight encoding ("K-quants" — block-quantized with importance weighting) plus mixed precision for sensitive layers (attention K/V matrices stay in 5-6 bits). For most models it's the sweet spot: ~4.5 bits per weight on average, <1% MMLU loss vs BF16, 4× memory reduction.

Related: GGUF, Quantization

QLoRA

Training

LoRA training on top of a 4-bit quantized base model — fits 70B fine-tuning on a 24GB GPU.

QLoRA (Dettmers et al., 2023) loads the frozen base model in NF4 quantization (4-bit), then trains LoRA adapters on top in BF16. Cuts training memory by 4× vs full BF16 LoRA, making 70B fine-tuning fit on a single 24GB consumer GPU. Standard in Unsloth, Axolotl, and Hugging Face PEFT.

Read the full guide →Related: LoRA (Low-Rank Adaptation), Unsloth

Quantization

Inference

Reducing the bit-width of model weights to save memory and increase throughput.

Quantization compresses model weights from BF16 (16 bits) to lower-bit formats: INT8 (-50% memory), Q4 / INT4 (-75%), Q3 (-81%), Q2 (-87.5%). Modern Q4_K_M (llama.cpp) loses <1 MMLU point vs BF16 on 70B models. Aggressive quantization (Q2-Q3) starts to noticeably degrade quality on smaller models.

Read the full guide →Related: GGUF, AWQ, GPTQ, Q4_K_M

Qwen

Models

Alibaba's open-weight LLM family — strong on coding and multilingual.

Qwen is Alibaba's LLM family. Notable: Qwen 2 (Jun 2024), Qwen 2.5 (Sep 2024), Qwen 3 (Apr 2025), Qwen3-Coder-Next (Apr 2026 — current best open coder), Qwen3.6-27B (Apr 2026 — best dense). Apache 2.0 licensed, with full commercial rights.

Read the full guide →

RAG (Retrieval-Augmented Generation)

Applications

Pattern: retrieve relevant documents, then condition the LLM on them.

RAG (Lewis et al., 2020) augments LLM generation with retrieval over an external knowledge base. Standard pipeline: chunk documents → embed chunks → retrieve top-K by query similarity → optionally rerank → feed retrieved context to LLM. The dominant pattern for "talking to your documents" — used by every major enterprise AI deployment in 2024-2026.

Related: Embedding Model, Reranker

Reasoning Model

Applications

Models trained to spend extra compute on internal "thinking" before answering.

Reasoning models (OpenAI o1/o3, DeepSeek R1, Claude with Adaptive Thinking, Gemini 3.1 Pro Deep Think) trade latency for quality on hard problems by generating long internal chain-of-thought before the user-facing answer. Can spend seconds to minutes "thinking" per query. Excel at math, code, and planning; overkill for chitchat.

Related: Chain of Thought (CoT)

Reranker

Models

A cross-encoder model that scores query-document relevance more accurately than embedding similarity.

Rerankers take a (query, document) pair as joint input to a transformer and produce a relevance score. Slower than embedding similarity but much more accurate. Standard RAG pattern: embedding model retrieves top 50, reranker re-scores to top 5. Top 2026: BGE-Reranker-v2, Cohere Rerank 3, Jina Reranker v2.

Read the full guide →Related: RAG (Retrieval-Augmented Generation), Embedding Model

RLAIF (RL from AI Feedback)

Training

RLHF variant using a strong AI model instead of humans to generate preference labels.

RLAIF replaces human raters with a strong AI judge (typically GPT-4-class). Cheaper and scales better than human feedback, while preserving most of the alignment quality. The Constitutional AI approach (Anthropic) is a structured form of RLAIF.

RLHF (Reinforcement Learning from Human Feedback)

Training

Training a reward model from human preferences, then using RL to optimize the LLM against it.

RLHF (Christiano et al., 2017; OpenAI 2022 InstructGPT) trains a reward model on human preference data, then uses PPO (a reinforcement learning algorithm) to fine-tune the LLM to maximize the reward. Computationally expensive but produces highly aligned models. Largely replaced by DPO in 2024 for most teams; still used at the frontier.

ROCm

Hardware

AMD's GPU compute platform — the CUDA equivalent for AMD GPUs.

ROCm (Radeon Open Compute) is AMD's open-source GPU compute stack. ROCm 6.x supports MI300X, MI325X, RX 7900 XTX, and Strix Halo iGPUs. PyTorch, vLLM, and llama.cpp all have ROCm builds. Performance and software maturity have closed most of the gap with CUDA in 2025-2026.

Related: CUDA, AMD MI300X

RoPE (Rotary Position Embedding)

Architecture

Position encoding that rotates query/key vectors by frequency-based angles.

RoPE (Su et al., 2021) encodes position by rotating Q and K vectors in 2D subspaces by angles proportional to position. Has better extrapolation properties than absolute position embeddings and is the de facto standard in modern LLMs. NTK-aware scaling and YaRN extend RoPE to longer contexts than training.

Related: Attention

RTX 4090

Hardware

NVIDIA's 2022 consumer flagship — 24GB GDDR6X, still excellent for local LLMs.

The RTX 4090 (Ada Lovelace) shipped with 24GB GDDR6X at 1008 GB/s — enough VRAM for 7B-13B models in BF16, 30B in Q4, or 70B with aggressive offload. Best price/performance consumer GPU for local LLMs through 2024-2025.

RTX 5090

Hardware

NVIDIA's 2025 consumer flagship — 32GB GDDR7, top-tier for local AI.

The RTX 5090 (Blackwell consumer) launched January 2025 with 32GB GDDR7, ~1.8 TB/s bandwidth, and FP4/FP8 support via 5th-gen tensor cores. The first consumer card capable of running a 32B BF16 model entirely on-GPU.

Related: RTX 4090, GDDR

Sampling Parameters

Inference

Parameters controlling token selection from the model's probability distribution.

Common sampling parameters: temperature (scales logits — higher = more random), top-k (restrict to top K tokens), top-p (nucleus — restrict to smallest set summing to p), min-p (restrict to tokens above min-p × max), repetition penalty (down-weight recently-used tokens), DRY (penalize n-gram repetitions), XTC (exclude top choices when overconfident). Modern stacks combine min-p + DRY for chat, low-temp + greedy for code.

Read the full guide →Related: Temperature, Top-p (Nucleus Sampling)

Self-Attention

Architecture

Attention where queries, keys, and values all come from the same sequence.

Self-attention lets every position in a sequence attend to every other position. The fundamental building block of transformers. Compute and memory are both O(N²) in sequence length, which is why long-context attention requires optimizations like FlashAttention, sliding window, or linear attention.

SFT (Supervised Fine-Tuning)

Training

Standard supervised training: prompt → expected response pairs.

SFT trains the model on (prompt, response) pairs using cross-entropy loss on the response tokens. The first stage of post-training after pretraining. Typically uses 10K-1M curated instruction examples. Followed by preference learning (DPO/RLHF) to refine response quality.

SGLang

Frameworks

Production LLM serving engine — vLLM competitor with focus on structured output and agents.

SGLang (LMSys, 2024) is a high-throughput serving engine that adds RadixAttention prefix caching, structured output (JSON Schema, regex), agent-aware caching, and very tight integration with frontier models like DeepSeek and Qwen. Often slightly faster than vLLM on long-context and multi-turn workloads.

Related: vLLM

Sliding Window Attention

Architecture

Attention restricted to a local window of tokens — used in Mistral and Phi for efficiency.

Sliding window attention restricts each token to attending only to the W tokens before it (and itself), giving O(N×W) instead of O(N²) attention cost. Used in Mistral 7B (W=4096) and several Phi models. Combined with full attention layers for "global" routing, can scale to long context cheaply.

Related: Attention

Speculative Decoding

Inference

Inference acceleration: a small draft model proposes K tokens, the large target verifies all in one pass.

Speculative decoding generates K candidate tokens with a small fast "draft" model, then verifies them in a single forward pass of the large target. Because LLM decode is bandwidth-bound, verifying K tokens at once is roughly the same wall-time as 1 token. With 70% acceptance rate at K=5, expect ~3× speedup with mathematically identical output. Variants: EAGLE, EAGLE-2, Medusa, prompt-lookup decoding.

Read the full guide →Related: EAGLE, Medusa

SWE-Bench

Benchmarks

Real-world coding benchmark — fix actual GitHub bugs across 12 popular Python repos.

SWE-Bench Verified contains 2,294 manually-validated GitHub issues from 12 popular Python repositories (Django, Sympy, Flask, etc.). The model gets the issue and codebase, must produce a patch that passes the project's own test suite. The single most important coding benchmark in 2026; correlates strongly with day-to-day developer productivity.

Read the full guide →Related: HumanEval / EvalPlus

Temperature

Inference

Sampling parameter that scales logits before softmax — higher = more random output.

Temperature divides the logits before softmax: T < 1 sharpens the distribution (more deterministic), T > 1 flattens it (more diverse). T = 0 is greedy (always pick the argmax). Defaults: 0.0-0.3 for code, 0.7 for chat, 1.0+ for creative writing. Modern samplers like min-p make temperature less critical at the high end.

Tokenizer

Training

Algorithm that splits text into integer token IDs the model can consume.

Tokenizers convert raw text into token sequences. Modern LLMs use Byte-Pair Encoding (BPE) or SentencePiece variants. A tokenizer's vocabulary size (50K-256K typical) and language coverage materially affect downstream performance and inference cost. Tiktoken (OpenAI), Llama tokenizer (Meta), Tekken (Mistral Nemo), and Tiktoken-derivative tokenizers dominate 2024-2026.

Related: BPE (Byte-Pair Encoding)

Top-p (Nucleus Sampling)

Inference

Sampling that restricts tokens to the smallest set whose cumulative probability ≥ p.

Top-p (also called nucleus sampling, Holtzman et al., 2019) sorts tokens by probability and keeps the smallest prefix that sums to at least p. p=0.9 typically removes the long tail of low-probability tokens while preserving diversity. Common defaults: p=0.9 chat, p=0.95 creative, p=0.7-0.8 code.

Transformer

Architecture

Neural network architecture using self-attention as its core operation, introduced in 2017.

The Transformer architecture (Vaswani et al., 2017) replaced recurrent networks with parallel self-attention. Every modern frontier LLM uses some Transformer variant. Key components: self-attention layers, feed-forward layers, residual connections, layer normalization, positional encoding (modern versions use RoPE).

Unsloth

Frameworks

High-throughput LoRA / QLoRA fine-tuning library — 2-5x faster than baseline Hugging Face PEFT.

Unsloth (Daniel Han, 2023+) is a fine-tuning library with hand-optimized Triton kernels for LoRA/QLoRA. 2-5× faster training and 50% less VRAM than Hugging Face PEFT defaults. Free tier supports single GPU; paid tier supports multi-GPU. Standard for hobbyist and SMB QLoRA work in 2024-2026.

Related: QLoRA, Axolotl

vLLM

Frameworks

High-throughput LLM serving engine using PagedAttention and continuous batching.

vLLM (UC Berkeley, 2023) is the de facto open-source production serving engine. PagedAttention KV management, continuous batching, prefix caching, FP8/AWQ/GPTQ support, OpenAI-compatible API, and tensor/pipeline parallelism. The default for any deployment serving > 1 concurrent user.

Related: PagedAttention, SGLang

VLM (Vision-Language Model)

Models

A model that processes both text and images jointly.

Vision-Language Models combine a vision encoder (often based on ViT or CLIP) with a language model trained to reason about images alongside text. Top 2026 examples: Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.5, Qwen 2-VL, Llama 3.2 Vision, GLM-4.5V. Use cases: OCR, chart understanding, multimodal search, agentic GUI control, video understanding.

Read the full guide →Related: Multimodal Model, LLM

VRAM

Hardware

Video RAM — the dedicated memory on a GPU where model weights, activations, and KV cache live.

VRAM (Video RAM) is the high-bandwidth memory on a GPU. For LLMs, total VRAM determines what model size you can run; VRAM bandwidth (measured in GB/s) determines how fast you can run it. RTX 4090: 24GB GDDR6X @ 1008 GB/s. H100: 80GB HBM3 @ 3.35 TB/s. M3 Ultra: up to 192GB unified memory @ 800 GB/s.

Read the full guide →Related: HBM (High Bandwidth Memory), GDDR

🎯

AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free Browse courses first

How we wrote this

Every entry was written from first-hand work — running, fine-tuning, and benchmarking these models and tools on real hardware. We avoid jargon where plain English works, and we avoid plain English when it obscures something a reader needs to know precisely. The goal: an entry should leave you able to follow a technical conversation about that term, not just identify it.

If a definition is wrong, outdated, or unclear, email us. The glossary is updated continuously — every major model release adds or refines entries within a week.

From definitions to building

Knowing the term isn't the same as shipping the system.

The 17-course AI Learning Path takes you from these terms to building real systems — RAG, agents, fine-tuning, multimodal, MLOps — all running on your hardware. First chapter of every course is free.

Start free See the courses

Related tools

AI Model Leaderboard →

Top 30 frontier and open models ranked by SWE-Bench, MMLU, ARC-AGI.

AI Model Finder →

Pick GPU + use case → recommended model. Free, instant, no signup.

VRAM Calculator →

Exact VRAM needs for any model + quantization combo.

Best AI Models 2026 →

Pillar comparison of the 10 most capable AI models of 2026.

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor

GitHub LinkedIn Twitter