Fine-Tune AI Models with Your Own Data Locally
Want to go deeper than this article?
The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.
Fine-Tune AI Models with Your Own Data Locally
Published on April 10, 2026 • 26 min read
The Short Version
Fine-tune a model on your data in four steps:
- Prepare data: Format as JSONL with instruction/input/output fields
- Install Unsloth:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" - Train: Load base model with QLoRA, train for 100-500 steps
- Deploy: Export to GGUF, create Ollama model, serve locally
Total time: 30-90 minutes on a single consumer GPU. Cost: electricity only.
What this guide covers:
- When fine-tuning beats prompting and RAG (and when it doesn't)
- Data preparation: format, quality, minimum dataset size
- LoRA vs QLoRA vs full fine-tuning explained
- Complete Unsloth training pipeline with real code
- Evaluation methods that actually tell you if it worked
- Deploying your custom model through Ollama
- Common pitfalls that waste your time
Fine-tuning is the process of taking a pre-trained model and teaching it new behaviors using your own data. You're not training from scratch -- you're adjusting an existing model's knowledge to match your specific needs. A customer support bot that knows your product catalog. A code assistant that follows your team's conventions. A medical summarizer that uses your institution's terminology.
The barrier to entry dropped dramatically over the past year. QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune a 7B parameter model on a GPU with just 4-6GB of VRAM. Unsloth makes the training process 2x faster than standard HuggingFace implementations. You no longer need cloud compute or expensive hardware.
For the LoRA fundamentals this guide builds on, see our LoRA fine-tuning local guide. If you're not sure whether fine-tuning is right for your use case, start with the decision framework in the next section.
Table of Contents
- Fine-Tuning vs Prompting vs RAG
- Data Preparation
- LoRA vs QLoRA vs Full Fine-Tuning
- Hardware Requirements
- Unsloth Setup
- Step-by-Step Training
- Evaluating Your Model
- Deploying with Ollama
- Cost Analysis
- Common Pitfalls
Fine-Tuning vs Prompting vs RAG {#when-to-fine-tune}
Three approaches exist to customize AI behavior. Choosing wrong wastes weeks.
Decision Framework
| Approach | Best When | Dataset Size | Effort | Result |
|---|---|---|---|---|
| Prompting | You need specific output format | 0 examples | 10 minutes | Good enough for many tasks |
| RAG | You need current, factual answers from documents | Any doc count | 1-2 hours setup | Great for knowledge bases |
| Fine-tuning | You need changed model behavior/style | 100-10,000 examples | 2-8 hours | Permanent behavior change |
When Fine-Tuning Is the Right Choice
Fine-tune when you need:
- Consistent output format that prompting can't enforce (e.g., always respond in specific JSON schema)
- Domain-specific language the base model doesn't understand (legal, medical, proprietary jargon)
- A specific tone or personality that system prompts can't reliably maintain
- Reduced inference cost (fine-tuned models need shorter prompts)
- Tasks where the model consistently fails despite good prompting
When Fine-Tuning Is the Wrong Choice
Don't fine-tune when:
- Your data changes frequently (use RAG instead -- it handles updates without retraining)
- You need factual accuracy from a knowledge base (RAG retrieves facts; fine-tuning memorizes patterns)
- A well-crafted system prompt already gets 90%+ accuracy (engineering the prompt is faster)
- You have fewer than 50 high-quality examples (the model won't learn meaningful patterns)
- You want to add "knowledge" to the model (that's what RAG is for)
The most common mistake: people fine-tune to add knowledge when they should use RAG, or they build RAG pipelines when they need style/format changes that fine-tuning handles trivially.
Data Preparation {#data-preparation}
Data quality is everything. A small dataset of 200 excellent examples will outperform 5,000 sloppy ones. Period.
Data Format
The standard format is JSONL (JSON Lines) with instruction-input-output triples:
{"instruction": "Summarize this customer complaint", "input": "I ordered product X three weeks ago and it still hasn't arrived. I've called twice and nobody can tell me where it is. This is unacceptable.", "output": "Issue: Delayed delivery (3+ weeks). Product: X. Previous contact: 2 calls, unresolved. Sentiment: Frustrated. Priority: High."}
{"instruction": "Summarize this customer complaint", "input": "The software crashes every time I try to export to PDF. I'm running version 4.2 on Windows 11.", "output": "Issue: PDF export crash (reproducible). Software: v4.2. OS: Windows 11. Sentiment: Neutral. Priority: Medium."}
Alternative Format: Chat/Conversation
{"conversations": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute lower back pain radiating to left leg, worsened by sitting."}, {"role": "assistant", "content": "Suggested ICD-10 codes:\n- M54.5: Low back pain\n- M54.41: Sciatica, left side\n\nPrimary: M54.41 (radiculopathy pattern suggests sciatica as primary)"}]}
Data Quality Checklist
- Consistent format -- Every example follows the same structure
- Correct outputs -- Human-verified, not AI-generated without review
- Representative -- Cover the full range of inputs your model will encounter
- Diverse -- Avoid too many similar examples (model will overfit)
- Clean -- No typos in outputs, no contradictory examples, no truncated text
Minimum Dataset Size
| Task Complexity | Minimum Examples | Recommended | Steps |
|---|---|---|---|
| Simple format change | 50 | 200 | 100 |
| Domain adaptation | 200 | 1,000 | 300 |
| Complex behavior | 500 | 2,000-5,000 | 500-1,000 |
| Multi-task | 1,000+ | 5,000-10,000 | 1,000+ |
Data Preparation Script
import json
def validate_dataset(filepath):
"""Validate a JSONL dataset for fine-tuning."""
issues = []
examples = []
with open(filepath, 'r') as f:
for i, line in enumerate(f):
try:
data = json.loads(line)
examples.append(data)
# Check required fields
if 'instruction' not in data:
issues.append(f"Line {i+1}: Missing 'instruction' field")
if 'output' not in data:
issues.append(f"Line {i+1}: Missing 'output' field")
# Check for empty outputs
if data.get('output', '').strip() == '':
issues.append(f"Line {i+1}: Empty output")
# Check output length
if len(data.get('output', '')) < 10:
issues.append(f"Line {i+1}: Very short output ({len(data.get('output', ''))} chars)")
except json.JSONDecodeError:
issues.append(f"Line {i+1}: Invalid JSON")
print(f"Total examples: {len(examples)}")
print(f"Issues found: {len(issues)}")
for issue in issues[:20]:
print(f" - {issue}")
# Check for duplicates
outputs = [e.get('output', '') for e in examples]
dupes = len(outputs) - len(set(outputs))
if dupes:
print(f" - {dupes} duplicate outputs detected")
return len(issues) == 0
validate_dataset("my_training_data.jsonl")
LoRA vs QLoRA vs Full Fine-Tuning {#training-methods}
Full Fine-Tuning
Updates every weight in the model. Produces the best results but requires enormous resources.
| Model Size | VRAM Needed | Training Time | When to Use |
|---|---|---|---|
| 1B | ~12GB | ~1 hour | Almost never locally |
| 7B | ~60GB | ~6 hours | Multi-GPU setups only |
| 13B | ~120GB | ~12 hours | Server/cloud only |
Verdict: Not practical on consumer hardware for models above 1B.
LoRA (Low-Rank Adaptation)
Freezes the base model and trains small adapter matrices that modify the model's behavior. Dramatically reduces memory and training time.
| Model Size | VRAM Needed | Adapter Size | Quality vs Full |
|---|---|---|---|
| 7B | ~14GB | 20-100MB | 95-98% |
| 13B | ~28GB | 30-150MB | 95-98% |
How LoRA works: Instead of updating a weight matrix W (e.g., 4096x4096 = 16M parameters), LoRA decomposes the update into two small matrices A and B where W' = W + AB. With rank r=16, that's only 409616 + 16*4096 = 131K parameters. You train 131K values instead of 16M.
QLoRA (Quantized LoRA)
Same concept as LoRA, but the base model is loaded in 4-bit quantization. This is the breakthrough that makes consumer-GPU fine-tuning possible.
| Model Size | VRAM Needed | Quality vs Full | Speed |
|---|---|---|---|
| 3B | ~4GB | 93-96% | Fast |
| 7B | ~6GB | 93-96% | Moderate |
| 13B | ~12GB | 93-96% | Slow |
| 32B | ~24GB | 93-96% | Very slow |
Verdict: QLoRA is the standard for local fine-tuning. The 3-5% quality gap versus full fine-tuning is rarely noticeable in practice.
Hardware Requirements {#hardware-requirements}
Minimum Specs for QLoRA Fine-Tuning
| Model Size | Min VRAM | Min RAM | Storage | Time (500 steps) |
|---|---|---|---|---|
| 3B | 4GB | 8GB | 10GB | 15-25 min |
| 7-8B | 6GB | 16GB | 20GB | 30-60 min |
| 13-14B | 10GB | 16GB | 40GB | 1-2 hours |
| 32B | 22GB | 32GB | 80GB | 3-6 hours |
| 70B | 44GB | 64GB | 160GB | 8-16 hours |
Tested GPU Performance
| GPU | VRAM | Max Model (QLoRA) | 7B Training Speed |
|---|---|---|---|
| RTX 3060 | 12GB | 13B | 3.2 samples/sec |
| RTX 4060 Ti | 8GB | 7B | 4.1 samples/sec |
| RTX 4070 | 12GB | 13B | 5.8 samples/sec |
| RTX 4090 | 24GB | 32B | 9.4 samples/sec |
| RTX 5090 | 32GB | 70B (tight) | 11.2 samples/sec |
Apple Silicon
Apple Silicon can fine-tune via MLX, but it's significantly slower than CUDA:
| Mac | RAM | Max Model | 7B Speed |
|---|---|---|---|
| M1 16GB | 16GB | 7B | 0.8 samples/sec |
| M3 Pro 18GB | 18GB | 7B | 1.4 samples/sec |
| M3 Max 36GB | 36GB | 14B | 1.8 samples/sec |
| M4 Pro 24GB | 24GB | 13B | 1.6 samples/sec |
An RTX 3060 trains 4x faster than an M3 Pro for the same model. NVIDIA is the clear choice for fine-tuning specifically. For inference-only workloads, Apple Silicon is more competitive. See our RAM requirements guide for inference-focused hardware sizing.
Unsloth Setup {#unsloth-setup}
Unsloth is the fastest open-source fine-tuning framework. It provides 2x training speed over standard HuggingFace, automatic memory optimization, and pre-quantized model loading.
Installation
# Create a clean environment
conda create -n finetune python=3.11
conda activate finetune
# Install PyTorch (CUDA)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install Unsloth
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
# Verify CUDA
python -c "import torch; print(torch.cuda.is_available())"
# Should print: True
Verify Setup
from unsloth import FastLanguageModel
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
Step-by-Step Training {#training-pipeline}
Here's a complete, working fine-tuning pipeline. This example fine-tunes Llama 3.2 3B on a custom instruction dataset.
Step 1: Load the Base Model
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True, # QLoRA
)
This loads the 3B model in 4-bit precision, consuming about 2.5GB VRAM. Unsloth provides pre-quantized checkpoints for all popular models.
Step 2: Add LoRA Adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank (8-64, higher = more capacity)
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_alpha=16, # Scaling factor (typically = r)
lora_dropout=0, # 0 for Unsloth (handled differently)
bias="none",
use_gradient_checkpointing="unsloth", # 60% less VRAM
random_state=42,
)
# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 3,254,779,904 || trainable%: 1.29
Only 1.29% of parameters are trainable. This is why QLoRA is so memory-efficient.
Step 3: Prepare the Dataset
from datasets import load_dataset
# Load from local file
dataset = load_dataset("json", data_files="my_training_data.jsonl", split="train")
# Format into chat template
def format_example(example):
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": example["instruction"] + ("\n" + example["input"] if example.get("input") else "")},
{"role": "assistant", "content": example["output"]}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
return {"text": text}
dataset = dataset.map(format_example)
print(f"Training examples: {len(dataset)}")
print(f"Sample:\n{dataset[0]['text'][:500]}")
Step 4: Configure Training
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=2,
packing=True, # Pack multiple examples into one sequence
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size: 2 * 4 = 8
warmup_steps=5,
max_steps=300, # Adjust based on dataset size
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
output_dir="outputs",
save_steps=100,
),
)
Step 5: Train
trainer_stats = trainer.train()
print(f"Training time: {trainer_stats.metrics['train_runtime']:.0f} seconds")
print(f"Samples/second: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")
Expected output on an RTX 3060 with 300 steps:
Training time: 420 seconds
Samples/second: 3.2
Final loss: 0.8124
Step 6: Test Before Exporting
# Switch to inference mode
FastLanguageModel.for_inference(model)
# Test with a prompt from your domain
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize this customer complaint: I've been waiting 3 weeks for my refund and nobody has responded to my emails."}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Evaluating Your Model {#evaluation}
Training loss going down doesn't mean your model is good. You need real evaluation.
Manual Evaluation (Most Important)
Create a test set of 20-50 examples the model has never seen. Run each through the model and score the outputs:
import json
test_examples = []
with open("test_data.jsonl") as f:
for line in f:
test_examples.append(json.loads(line))
results = []
for ex in test_examples:
messages = [
{"role": "user", "content": ex["instruction"] + "\n" + ex.get("input", "")}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.3)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
results.append({
"input": ex["instruction"],
"expected": ex["output"],
"generated": generated,
})
# Save for manual review
with open("evaluation_results.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Saved {len(results)} evaluation results for manual review")
Automated Metrics
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = []
for r in results:
score = scorer.score(r["expected"], r["generated"])
scores.append(score['rougeL'].fmeasure)
avg_rouge = sum(scores) / len(scores)
print(f"Average ROUGE-L: {avg_rouge:.3f}")
# Good: > 0.5 for summarization
# Good: > 0.7 for structured output
A/B Testing Against Base Model
The real question is: "Is my fine-tuned model better than the base model with a good prompt?" Always test this:
# Load base model (without fine-tuning) for comparison
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
# Run same test set through base model with your best system prompt
# Compare outputs side by side
If the fine-tuned model isn't clearly better than the base model + prompt engineering, don't deploy it. Iterate on your training data instead.
Deploying with Ollama {#deployment}
Step 1: Export to GGUF
# Save merged model (LoRA weights merged into base)
model.save_pretrained_merged(
"finetuned-model",
tokenizer,
save_method="merged_16bit",
)
# Or save as separate LoRA adapter (smaller, portable)
model.save_pretrained("finetuned-lora")
Step 2: Convert to GGUF
# Clone llama.cpp if you haven't
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt
# Convert
python convert_hf_to_gguf.py ../finetuned-model \
--outtype q4_K_M \
--outfile finetuned-model-q4.gguf
Step 3: Create Ollama Model
# Write the Modelfile
cat > Modelfile << 'EOF'
FROM ./finetuned-model-q4.gguf
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|eot_id|>"
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant specialized in customer support ticket analysis."
EOF
# Create the model
ollama create my-custom-model -f Modelfile
# Test it
ollama run my-custom-model "Summarize: The product broke after 2 days and I want a replacement."
Step 4: Verify Deployment
# Check model is listed
ollama list | grep my-custom-model
# Test via API
curl http://localhost:11434/api/generate -d '{
"model": "my-custom-model",
"prompt": "Summarize this ticket: Customer reports login failures since the update.",
"stream": false
}'
Your fine-tuned model is now available through Ollama's standard API, compatible with any Ollama client including Open WebUI.
Cost Analysis {#cost-analysis}
Local Fine-Tuning Costs
| Component | One-Time Cost | Per-Training-Run |
|---|---|---|
| GPU (RTX 3060 used) | $180-220 | $0 |
| Electricity (1 hour) | -- | $0.15-0.30 |
| Software | $0 | $0 |
| Total (first run) | ~$200 | ~$200 |
| Total (subsequent) | $0 | ~$0.25 |
Cloud Fine-Tuning Comparison
| Service | 7B Model, 500 Steps | Per Run |
|---|---|---|
| OpenAI fine-tuning | $15-50 (depending on dataset) | $15-50 |
| Together AI | $5-20 | $5-20 |
| Google Colab Pro | $10/month (limited hours) | ~$3-5 |
| Local (after GPU purchase) | Electricity only | ~$0.25 |
After 10-15 fine-tuning runs, the local GPU has paid for itself versus cloud alternatives. If you iterate frequently on training data (which you should), local is dramatically cheaper.
Time Investment
| Phase | First Time | Subsequent Runs |
|---|---|---|
| Environment setup | 30-60 min | 0 min (already done) |
| Data preparation | 2-8 hours | 30 min (iterating) |
| Training (300 steps, 7B) | 30-60 min | 30-60 min |
| Evaluation | 1-2 hours | 30 min |
| Deployment | 30 min | 5 min |
Common Pitfalls {#pitfalls}
1. Training on AI-Generated Data Without Review
If you use GPT-4 or Claude to generate your training data, you're teaching your local model to mimic another AI -- including its errors, hallucinations, and biases. Always have a human review and correct AI-generated training examples.
2. Too Many Training Steps (Overfitting)
Symptom: Model gives perfect answers for training-like inputs but gibberish or repetitive outputs for novel inputs.
Fix: Reduce max_steps. Start with dataset_size / batch_size * 3 epochs. Monitor training loss -- if it drops below 0.1, you're almost certainly overfitting.
# Good heuristic for max_steps
num_examples = 500
batch_size = 8
epochs = 3
max_steps = (num_examples // batch_size) * epochs # 187 steps
3. Too Little Data
50 examples is the absolute minimum for learning basic format changes. For domain adaptation, you need 200+. For complex multi-task behavior, 1,000+. If you have fewer than 100 examples, spend more time on data collection before training.
4. Inconsistent Training Data
If half your examples use formal tone and half use casual tone, the model will randomly switch between them. Audit your dataset for consistency in:
- Output format (JSON, prose, bullet points)
- Tone (formal, casual, technical)
- Length (all short, all long, or intentionally varied)
- Level of detail
5. Wrong Base Model
Starting from the wrong base model wastes capacity. Guidelines:
- English general tasks: Llama 3.2 3B/8B
- Multilingual: Qwen 2.5 7B/14B
- Code: Qwen2.5-Coder or CodeLlama
- Reasoning-heavy: DeepSeek R1 distill variants
- Constrained hardware: Gemma 3 1B or Phi-4 Mini
6. Skipping Evaluation
"The loss went down so it must be working" is a trap. Loss measures how well the model predicts the next token on training data. It does not measure whether the model is useful for your actual task. Always run evaluation on a held-out test set.
7. LoRA Rank Too Low
With r=4 (very low rank), the adapter doesn't have enough capacity to learn complex behaviors. Start with r=16 for most tasks. Increase to 32 or 64 if you have enough VRAM and the model isn't learning the task. r=8 is fine for simple format changes.
Conclusion
Fine-tuning a model on your own data is no longer a research exercise. With QLoRA and Unsloth, the entire process -- from raw data to a deployed Ollama model -- takes 2-4 hours on a single consumer GPU.
The hard part isn't the code. It's the data. Spend 80% of your effort on data quality and 20% on training configuration. Start with 200 carefully reviewed examples, train for 300 steps, evaluate honestly, and iterate. Three rounds of data improvement plus retraining will outperform one round with 10x more data.
If you're just starting with local AI and haven't picked your hardware yet, our RAM requirements guide will help you size a machine that handles both inference and training. For the foundational LoRA concepts, see our LoRA fine-tuning guide.
The complete Unsloth documentation and model zoo are available at the Unsloth GitHub repository. Pre-quantized base models for common architectures are listed at HuggingFace's training documentation.
Ready to fine-tune but not sure which base model to start from? Our best local AI models for 8GB RAM guide ranks every major family by capability and memory usage.
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Enjoyed this? There are 10 full courses waiting.
10 complete AI courses. From fundamentals to production. Everything runs on your hardware.
Build Real AI on Your Machine
RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!