Question 1

What is DPO and why did it replace RLHF for most fine-tuning?

Accepted Answer

DPO (Direct Preference Optimization, Rafailov et al. 2023) reformulates preference learning to skip the reward model and PPO loop that RLHF requires. Instead of train-reward-model → train-policy-with-RL, DPO directly optimizes the policy with a closed-form loss derived from a Bradley-Terry preference model. Result: simpler training (looks like supervised fine-tuning), more stable, comparable or better quality. RLHF needed ~100K GPU-hours for serious alignment runs; DPO often does it in ~1K GPU-hours. By 2024-2025, DPO had largely replaced RLHF for open-weight model alignment.

Question 2

What is ORPO and how does it differ from DPO?

Accepted Answer

ORPO (Odds Ratio Preference Optimization, Hong et al. 2024) combines preference fine-tuning with supervised fine-tuning in a single stage. Where DPO requires a separate SFT step before preference tuning, ORPO does both at once via a unified loss function. Result: simpler pipeline (one stage instead of two), comparable or better quality, less compute. ORPO is becoming the new default in 2026 for new alignment runs. For existing SFT models, DPO remains valid; for from-scratch alignment, ORPO is more efficient.

Question 3

What is KTO and when should I use it?

Accepted Answer

KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2024) handles **unpaired** preference data — instead of needing (chosen, rejected) pairs like DPO/ORPO, KTO works with binary signals: "this response is good" / "this response is bad" without needing the alternative. Useful when your data is naturally unpaired: thumbs-up/thumbs-down user feedback, A/B test outcomes, automated quality scores. Quality is comparable to DPO when paired data is available. KTO's superpower: making use of much larger unpaired datasets.

Question 4

What dataset format do these methods need?

Accepted Answer

DPO/ORPO need **paired** data: each row is `{prompt, chosen_response, rejected_response}`. Sources: human ratings of two model outputs, AI-generated comparisons (often Claude 3 / GPT-4 as judge), or repurposed reward-model training data (HH-RLHF, UltraFeedback, etc.). KTO needs **unpaired** data: `{prompt, response, label}` where label is "good" or "bad". Sources: production user feedback (thumbs up/down), automated quality classifiers, A/B test outcomes. For most users: 1K-10K well-curated preference pairs (DPO/ORPO) or 5K-50K unpaired examples (KTO).

Question 5

How do I run DPO / ORPO / KTO?

Accepted Answer

Use Hugging Face TRL or Axolotl. TRL: `from trl import DPOTrainer, ORPOTrainer, KTOTrainer` — same API pattern as SFTTrainer, just pass preference dataset and run. Axolotl: set `rl: dpo` / `rl: orpo` / `rl: kto` in YAML config. Combined with QLoRA: `adapter: qlora` plus the rl config = QLoRA-DPO / QLoRA-ORPO / QLoRA-KTO. On RTX 4090 with QLoRA: DPO of Llama 3.1 8B with 1K pairs takes ~3-6 hours.

Question 6

Should I do SFT first or jump straight to DPO/ORPO/KTO?

Accepted Answer

For DPO and KTO: do SFT first. Both methods assume the model has already learned the basic instruction-following format and tone. Going straight to DPO on a base model (no SFT) usually fails — there's no "good" baseline to optimize from. For ORPO: skip SFT — ORPO unifies them. The pipeline becomes: (1) Base model → SFT → DPO, OR (2) Base model → ORPO. Choose based on whether you already have a SFT checkpoint to build on.

Question 7

How do I create high-quality preference data on a budget?

Accepted Answer

Three approaches in order of cost. (1) **Repurpose existing**: HH-RLHF (Anthropic), UltraFeedback, OpenAssistant's preference data — free and large. (2) **AI judge synthetic**: generate 2 responses per prompt with your model, ask Claude 3.5 Sonnet / GPT-4o which is better — costs $5-20 per 1K pairs. (3) **Human annotation**: hire annotators via Surge AI / Scale / Prolific — costs $1-5 per pair. For most domain adaptation: option 2 (AI judge) is the sweet spot. For style / brand voice: human annotation matters more.

Question 8

Does preference fine-tuning hurt the model on other tasks?

Accepted Answer

Yes — alignment tax is real. Fine-tuning aggressively for a specific preference signature degrades performance on benchmarks unrelated to that preference. Mitigations: (1) keep DPO beta high (0.1-0.5) to limit deviation from reference, (2) train for fewer epochs (1-2), (3) include a fraction of "general" SFT data in the training mix, (4) evaluate on held-out non-preference benchmarks regularly. For best results: small targeted preference dataset + brief fine-tune + careful evaluation.

Scenario	Method
Existing SFT checkpoint + paired preferences	DPO
Base model + paired preferences (no SFT done)	ORPO
Unpaired feedback (thumbs up/down, A/B)	KTO
Very small preference dataset (<500 pairs)	DPO with high β
Large unpaired dataset (>10K examples)	KTO
Want simplest pipeline	ORPO

Parameter	DPO	ORPO	KTO
Learning rate	5e-7 to 1e-6	8e-6 to 5e-5	5e-7 to 1e-6
β	0.1 default; 0.01-0.5 range	n/a	0.1 default
λ (ORPO)	n/a	0.1 default	n/a
Epochs	1-2	1-3	1-2

Symptom	Cause	Fix
Catastrophic forgetting	LR too high	Drop to 5e-7
No preference learning	β too high or LR too low	Lower β to 0.05 or raise LR
Model becomes verbose / over-confident	Over-training	Fewer epochs, smaller dataset
Rejected responses still preferred	Bad data	Re-curate preference dataset
OOM with reference model	Two models in memory	Use `model_adapter_name=None` for adapter-only DPO

DPO, ORPO, KTO: Preference Fine-Tuning for Local LLMs (2026)

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

Why Preference Fine-Tuning Matters {#why}

The RLHF Pipeline (and Why It Was Replaced) {#rlhf}

DPO: Direct Preference Optimization {#dpo}

Reading articles is good. Building is better.

ORPO: Unified SFT + Preference {#orpo}

KTO: Unpaired Preference Signals {#kto}

Choosing DPO vs ORPO vs KTO {#choosing}

Dataset Format and Sources {#dataset}

TRL Integration {#trl}

Axolotl Integration {#axolotl}

Hyperparameter Tuning {#hyperparameters}

QLoRA + DPO/ORPO/KTO {#qlora}

Evaluating Preference-Tuned Models {#evaluation}

Alignment Tax and Mitigations {#alignment-tax}

Common Failures {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

QLoRA Fine-Tuning Guide

LoRA Fine-Tuning Local Guide

Fine Tune Local AI Business

Train AI Model Own Data Locally

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI