DPO, ORPO, KTO: Preference Fine-Tuning for Local LLMs (2026)
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
DPO replaced RLHF as the preferred alignment method for open-weight LLMs in 2024. ORPO unified SFT and preference fine-tuning into a single stage in 2024. KTO handles unpaired feedback signals. By 2026, all three are standard tools in any serious local fine-tuning pipeline. They share a common goal — align the model to your preferences — but use different mathematical formulations and dataset assumptions.
This guide covers everything: the math behind each method (skipping RLHF's reward-model + PPO complexity), dataset formats, when to use which, integration with QLoRA via TRL and Axolotl, hyperparameter tuning, common failure modes, and concrete recipes for aligning Llama 3.1 8B to a specific brand voice or domain style.
Table of Contents
- Why Preference Fine-Tuning Matters
- The RLHF Pipeline (and Why It Was Replaced)
- DPO: Direct Preference Optimization
- ORPO: Unified SFT + Preference
- KTO: Unpaired Preference Signals
- Choosing DPO vs ORPO vs KTO
- Dataset Format and Sources
- TRL Integration
- Axolotl Integration
- Hyperparameter Tuning
- QLoRA + DPO/ORPO/KTO
- Evaluating Preference-Tuned Models
- Alignment Tax and Mitigations
- Common Failures
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Why Preference Fine-Tuning Matters {#why}
Standard SFT (supervised fine-tuning) teaches a model to imitate target outputs. But often you have multiple acceptable responses with quality differences — and you want the model to learn which is better, not just which appeared in your training set.
Preference fine-tuning operationalizes "better than": train the model to assign higher probability to preferred responses than rejected ones.
Use cases:
- Style adaptation (brand voice, formality, humor)
- Helpfulness vs harmlessness trade-offs
- Refusal behavior tuning
- Domain-specific quality (medical accuracy, legal citation style)
- A/B test winning patterns
The RLHF Pipeline (and Why It Was Replaced) {#rlhf}
Classical RLHF (used by InstructGPT, ChatGPT, early Llama 2 chat):
Step 1: SFT on demonstrations
Step 2: Train a reward model on preference pairs
Step 3: Use PPO reinforcement learning to optimize policy against reward model
Problems: PPO is unstable, requires careful hyperparameter tuning, ~100K GPU-hours for serious runs, prone to reward hacking.
DPO showed that the reward model and PPO step can be eliminated — directly optimize the policy on preferences via a closed-form loss. Most open-weight alignment in 2024-2026 is DPO/ORPO/KTO; RLHF is largely retired.
DPO: Direct Preference Optimization {#dpo}
The DPO loss for a preference pair (prompt x, chosen y_w, rejected y_l):
L_DPO = -log σ(β · log π(y_w|x)/π_ref(y_w|x) - β · log π(y_l|x)/π_ref(y_l|x))
Where π is the model being trained, π_ref is a frozen reference (typically the SFT checkpoint), and β controls deviation magnitude.
Intuition: increase the log-probability of chosen responses relative to rejected, weighted against the reference model. β=0.1 is the typical default.
Pipeline:
- SFT on demonstrations (gives π_ref)
- DPO on preference pairs (yields aligned π)
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
ORPO: Unified SFT + Preference {#orpo}
ORPO (Hong et al., March 2024) unifies SFT and DPO into a single loss:
L_ORPO = L_SFT(y_w) + λ · L_OR(y_w, y_l)
Where L_SFT is standard SFT loss on chosen responses, and L_OR is an odds-ratio loss penalizing rejected responses. λ controls the strength of preference signal.
Result: skip the SFT-then-DPO two-stage pipeline. Train once, get both. Quality is comparable or slightly better than SFT+DPO at half the compute.
Use ORPO when starting from a base (non-SFT) model. Use DPO when you already have an SFT checkpoint.
KTO: Unpaired Preference Signals {#kto}
KTO (Ethayarajh et al., 2024) handles binary feedback without paired comparisons:
L_KTO = E[w_y · σ(β · (log π(y|x)/π_ref(y|x) - z_ref))]
Where w_y is +1 for "good" responses, -1 for "bad", and z_ref is a baseline. Loss derived from Kahneman-Tversky prospect theory (loss-aversion modeling).
Useful when:
- You have user thumbs-up/thumbs-down without paired comparisons
- A/B test outcomes (winner / loser of single response shown)
- Automated quality classifier scores
- Larger datasets of unpaired feedback than paired
Quality matches DPO when paired data is available; surpasses when only unpaired data exists.
Choosing DPO vs ORPO vs KTO {#choosing}
| Scenario | Method |
|---|---|
| Existing SFT checkpoint + paired preferences | DPO |
| Base model + paired preferences (no SFT done) | ORPO |
| Unpaired feedback (thumbs up/down, A/B) | KTO |
| Very small preference dataset (<500 pairs) | DPO with high β |
| Large unpaired dataset (>10K examples) | KTO |
| Want simplest pipeline | ORPO |
Dataset Format and Sources {#dataset}
DPO / ORPO format:
{
"prompt": "What is local AI?",
"chosen": "Local AI is...",
"rejected": "Local AI is some thing..."
}
KTO format:
{"prompt": "What is local AI?", "response": "Local AI is...", "label": true}
{"prompt": "What is local AI?", "response": "Local AI is some thing...", "label": false}
Sources:
- HH-RLHF (Anthropic) — helpfulness/harmlessness pairs
- UltraFeedback — broad preference pairs
- OpenAssistant — community-rated responses
- Synthetic AI judge: generate 2 responses, ask GPT-4o/Claude which is better
- Production logs with thumbs-up/down (KTO)
- Manual annotation for brand voice / specific tone
For most domain adaptation: 1K-5K AI-judge synthetic pairs is the cheapest practical route.
TRL Integration {#trl}
from trl import DPOTrainer, ORPOTrainer, KTOTrainer
from datasets import load_dataset
# DPO
trainer = DPOTrainer(
model=model,
ref_model=ref_model, # frozen SFT checkpoint
args=DPOConfig(
beta=0.1,
learning_rate=5e-7,
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
output_dir="./dpo_output",
),
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
trainer.train()
ORPO and KTO have similar APIs — see TRL docs.
Axolotl Integration {#axolotl}
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
adapter: qlora
load_in_4bit: true
lora_r: 32
lora_alpha: 64
rl: dpo # or orpo, kto
dpo_beta: 0.1
datasets:
- path: ./preferences.jsonl
type: chatml.intel
split: train
learning_rate: 5e-7
num_epochs: 1
micro_batch_size: 2
gradient_accumulation_steps: 4
accelerate launch -m axolotl.cli.train dpo.yml
Hyperparameter Tuning {#hyperparameters}
| Parameter | DPO | ORPO | KTO |
|---|---|---|---|
| Learning rate | 5e-7 to 1e-6 | 8e-6 to 5e-5 | 5e-7 to 1e-6 |
| β | 0.1 default; 0.01-0.5 range | n/a | 0.1 default |
| λ (ORPO) | n/a | 0.1 default | n/a |
| Epochs | 1-2 | 1-3 | 1-2 |
Lower learning rates than SFT! Preference fine-tuning is sensitive — too high LR causes catastrophic forgetting.
QLoRA + DPO/ORPO/KTO {#qlora}
Combining 4-bit quantized base with preference fine-tuning:
# Axolotl
adapter: qlora
load_in_4bit: true
lora_r: 32
lora_alpha: 64
rl: dpo
dpo_beta: 0.1
Memory: similar to QLoRA SFT. Time: 1.5-2x SFT (DPO needs reference model evaluation each step).
On RTX 4090, QLoRA-DPO of Llama 3.1 8B with 1K pairs: ~3-6 hours.
Evaluating Preference-Tuned Models {#evaluation}
Three evaluation axes:
- Held-out preference accuracy: does the model prefer chosen over rejected on held-out pairs?
- General benchmark regression: MMLU / HumanEval / GSM8K shouldn't drop more than 1-3%
- Qualitative: spot-check 50-100 outputs by hand
For brand voice / style adaptation: human evaluation matters more than benchmarks. Have target users rate 30-50 outputs blind vs the SFT-only model.
Alignment Tax and Mitigations {#alignment-tax}
Preference fine-tuning often degrades unrelated capabilities — the "alignment tax." Mitigations:
- Higher β (0.3-0.5) — limits deviation from reference
- Fewer epochs (1-2 max)
- Mix in SFT data — 30-50% general SFT examples in the training mix
- Smaller LoRA rank — limits capacity for over-specialization
- Regular evaluation on out-of-domain benchmarks during training
Common Failures {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Catastrophic forgetting | LR too high | Drop to 5e-7 |
| No preference learning | β too high or LR too low | Lower β to 0.05 or raise LR |
| Model becomes verbose / over-confident | Over-training | Fewer epochs, smaller dataset |
| Rejected responses still preferred | Bad data | Re-curate preference dataset |
| OOM with reference model | Two models in memory | Use model_adapter_name=None for adapter-only DPO |
FAQ {#faq}
See answers to common DPO/ORPO/KTO questions below.
Sources: DPO paper (Rafailov et al., 2023) | ORPO paper (Hong et al., 2024) | KTO paper (Ethayarajh et al., 2024) | HuggingFace TRL | Axolotl.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!