Training Guide

Fine-Tune AI Models with Your Own Data Locally

April 10, 2026
26 min read
Local AI Master Research Team

Want to go deeper than this article?

The AI Learning Path covers this topic and more — hands-on chapters across 10 courses across 10 courses.

Fine-Tune AI Models with Your Own Data Locally

Published on April 10, 2026 • 26 min read

The Short Version

Fine-tune a model on your data in four steps:

  1. Prepare data: Format as JSONL with instruction/input/output fields
  2. Install Unsloth: pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
  3. Train: Load base model with QLoRA, train for 100-500 steps
  4. Deploy: Export to GGUF, create Ollama model, serve locally

Total time: 30-90 minutes on a single consumer GPU. Cost: electricity only.


What this guide covers:

  • When fine-tuning beats prompting and RAG (and when it doesn't)
  • Data preparation: format, quality, minimum dataset size
  • LoRA vs QLoRA vs full fine-tuning explained
  • Complete Unsloth training pipeline with real code
  • Evaluation methods that actually tell you if it worked
  • Deploying your custom model through Ollama
  • Common pitfalls that waste your time

Fine-tuning is the process of taking a pre-trained model and teaching it new behaviors using your own data. You're not training from scratch -- you're adjusting an existing model's knowledge to match your specific needs. A customer support bot that knows your product catalog. A code assistant that follows your team's conventions. A medical summarizer that uses your institution's terminology.

The barrier to entry dropped dramatically over the past year. QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune a 7B parameter model on a GPU with just 4-6GB of VRAM. Unsloth makes the training process 2x faster than standard HuggingFace implementations. You no longer need cloud compute or expensive hardware.

For the LoRA fundamentals this guide builds on, see our LoRA fine-tuning local guide. If you're not sure whether fine-tuning is right for your use case, start with the decision framework in the next section.

Table of Contents

  1. Fine-Tuning vs Prompting vs RAG
  2. Data Preparation
  3. LoRA vs QLoRA vs Full Fine-Tuning
  4. Hardware Requirements
  5. Unsloth Setup
  6. Step-by-Step Training
  7. Evaluating Your Model
  8. Deploying with Ollama
  9. Cost Analysis
  10. Common Pitfalls

Fine-Tuning vs Prompting vs RAG {#when-to-fine-tune}

Three approaches exist to customize AI behavior. Choosing wrong wastes weeks.

Decision Framework

ApproachBest WhenDataset SizeEffortResult
PromptingYou need specific output format0 examples10 minutesGood enough for many tasks
RAGYou need current, factual answers from documentsAny doc count1-2 hours setupGreat for knowledge bases
Fine-tuningYou need changed model behavior/style100-10,000 examples2-8 hoursPermanent behavior change

When Fine-Tuning Is the Right Choice

Fine-tune when you need:

  • Consistent output format that prompting can't enforce (e.g., always respond in specific JSON schema)
  • Domain-specific language the base model doesn't understand (legal, medical, proprietary jargon)
  • A specific tone or personality that system prompts can't reliably maintain
  • Reduced inference cost (fine-tuned models need shorter prompts)
  • Tasks where the model consistently fails despite good prompting

When Fine-Tuning Is the Wrong Choice

Don't fine-tune when:

  • Your data changes frequently (use RAG instead -- it handles updates without retraining)
  • You need factual accuracy from a knowledge base (RAG retrieves facts; fine-tuning memorizes patterns)
  • A well-crafted system prompt already gets 90%+ accuracy (engineering the prompt is faster)
  • You have fewer than 50 high-quality examples (the model won't learn meaningful patterns)
  • You want to add "knowledge" to the model (that's what RAG is for)

The most common mistake: people fine-tune to add knowledge when they should use RAG, or they build RAG pipelines when they need style/format changes that fine-tuning handles trivially.


Data Preparation {#data-preparation}

Data quality is everything. A small dataset of 200 excellent examples will outperform 5,000 sloppy ones. Period.

Data Format

The standard format is JSONL (JSON Lines) with instruction-input-output triples:

{"instruction": "Summarize this customer complaint", "input": "I ordered product X three weeks ago and it still hasn't arrived. I've called twice and nobody can tell me where it is. This is unacceptable.", "output": "Issue: Delayed delivery (3+ weeks). Product: X. Previous contact: 2 calls, unresolved. Sentiment: Frustrated. Priority: High."}
{"instruction": "Summarize this customer complaint", "input": "The software crashes every time I try to export to PDF. I'm running version 4.2 on Windows 11.", "output": "Issue: PDF export crash (reproducible). Software: v4.2. OS: Windows 11. Sentiment: Neutral. Priority: Medium."}

Alternative Format: Chat/Conversation

{"conversations": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute lower back pain radiating to left leg, worsened by sitting."}, {"role": "assistant", "content": "Suggested ICD-10 codes:\n- M54.5: Low back pain\n- M54.41: Sciatica, left side\n\nPrimary: M54.41 (radiculopathy pattern suggests sciatica as primary)"}]}

Data Quality Checklist

  1. Consistent format -- Every example follows the same structure
  2. Correct outputs -- Human-verified, not AI-generated without review
  3. Representative -- Cover the full range of inputs your model will encounter
  4. Diverse -- Avoid too many similar examples (model will overfit)
  5. Clean -- No typos in outputs, no contradictory examples, no truncated text

Minimum Dataset Size

Task ComplexityMinimum ExamplesRecommendedSteps
Simple format change50200100
Domain adaptation2001,000300
Complex behavior5002,000-5,000500-1,000
Multi-task1,000+5,000-10,0001,000+

Data Preparation Script

import json

def validate_dataset(filepath):
    """Validate a JSONL dataset for fine-tuning."""
    issues = []
    examples = []

    with open(filepath, 'r') as f:
        for i, line in enumerate(f):
            try:
                data = json.loads(line)
                examples.append(data)

                # Check required fields
                if 'instruction' not in data:
                    issues.append(f"Line {i+1}: Missing 'instruction' field")
                if 'output' not in data:
                    issues.append(f"Line {i+1}: Missing 'output' field")

                # Check for empty outputs
                if data.get('output', '').strip() == '':
                    issues.append(f"Line {i+1}: Empty output")

                # Check output length
                if len(data.get('output', '')) < 10:
                    issues.append(f"Line {i+1}: Very short output ({len(data.get('output', ''))} chars)")

            except json.JSONDecodeError:
                issues.append(f"Line {i+1}: Invalid JSON")

    print(f"Total examples: {len(examples)}")
    print(f"Issues found: {len(issues)}")
    for issue in issues[:20]:
        print(f"  - {issue}")

    # Check for duplicates
    outputs = [e.get('output', '') for e in examples]
    dupes = len(outputs) - len(set(outputs))
    if dupes:
        print(f"  - {dupes} duplicate outputs detected")

    return len(issues) == 0

validate_dataset("my_training_data.jsonl")

LoRA vs QLoRA vs Full Fine-Tuning {#training-methods}

Full Fine-Tuning

Updates every weight in the model. Produces the best results but requires enormous resources.

Model SizeVRAM NeededTraining TimeWhen to Use
1B~12GB~1 hourAlmost never locally
7B~60GB~6 hoursMulti-GPU setups only
13B~120GB~12 hoursServer/cloud only

Verdict: Not practical on consumer hardware for models above 1B.

LoRA (Low-Rank Adaptation)

Freezes the base model and trains small adapter matrices that modify the model's behavior. Dramatically reduces memory and training time.

Model SizeVRAM NeededAdapter SizeQuality vs Full
7B~14GB20-100MB95-98%
13B~28GB30-150MB95-98%

How LoRA works: Instead of updating a weight matrix W (e.g., 4096x4096 = 16M parameters), LoRA decomposes the update into two small matrices A and B where W' = W + AB. With rank r=16, that's only 409616 + 16*4096 = 131K parameters. You train 131K values instead of 16M.

QLoRA (Quantized LoRA)

Same concept as LoRA, but the base model is loaded in 4-bit quantization. This is the breakthrough that makes consumer-GPU fine-tuning possible.

Model SizeVRAM NeededQuality vs FullSpeed
3B~4GB93-96%Fast
7B~6GB93-96%Moderate
13B~12GB93-96%Slow
32B~24GB93-96%Very slow

Verdict: QLoRA is the standard for local fine-tuning. The 3-5% quality gap versus full fine-tuning is rarely noticeable in practice.


Hardware Requirements {#hardware-requirements}

Minimum Specs for QLoRA Fine-Tuning

Model SizeMin VRAMMin RAMStorageTime (500 steps)
3B4GB8GB10GB15-25 min
7-8B6GB16GB20GB30-60 min
13-14B10GB16GB40GB1-2 hours
32B22GB32GB80GB3-6 hours
70B44GB64GB160GB8-16 hours

Tested GPU Performance

GPUVRAMMax Model (QLoRA)7B Training Speed
RTX 306012GB13B3.2 samples/sec
RTX 4060 Ti8GB7B4.1 samples/sec
RTX 407012GB13B5.8 samples/sec
RTX 409024GB32B9.4 samples/sec
RTX 509032GB70B (tight)11.2 samples/sec

Apple Silicon

Apple Silicon can fine-tune via MLX, but it's significantly slower than CUDA:

MacRAMMax Model7B Speed
M1 16GB16GB7B0.8 samples/sec
M3 Pro 18GB18GB7B1.4 samples/sec
M3 Max 36GB36GB14B1.8 samples/sec
M4 Pro 24GB24GB13B1.6 samples/sec

An RTX 3060 trains 4x faster than an M3 Pro for the same model. NVIDIA is the clear choice for fine-tuning specifically. For inference-only workloads, Apple Silicon is more competitive. See our RAM requirements guide for inference-focused hardware sizing.


Unsloth Setup {#unsloth-setup}

Unsloth is the fastest open-source fine-tuning framework. It provides 2x training speed over standard HuggingFace, automatic memory optimization, and pre-quantized model loading.

Installation

# Create a clean environment
conda create -n finetune python=3.11
conda activate finetune

# Install PyTorch (CUDA)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install Unsloth
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify CUDA
python -c "import torch; print(torch.cuda.is_available())"
# Should print: True

Verify Setup

from unsloth import FastLanguageModel
import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Step-by-Step Training {#training-pipeline}

Here's a complete, working fine-tuning pipeline. This example fine-tunes Llama 3.2 3B on a custom instruction dataset.

Step 1: Load the Base Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,            # Auto-detect
    load_in_4bit=True,     # QLoRA
)

This loads the 3B model in 4-bit precision, consuming about 2.5GB VRAM. Unsloth provides pre-quantized checkpoints for all popular models.

Step 2: Add LoRA Adapters

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                  # LoRA rank (8-64, higher = more capacity)
    target_modules=[       # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha=16,         # Scaling factor (typically = r)
    lora_dropout=0,        # 0 for Unsloth (handled differently)
    bias="none",
    use_gradient_checkpointing="unsloth",  # 60% less VRAM
    random_state=42,
)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 3,254,779,904 || trainable%: 1.29

Only 1.29% of parameters are trainable. This is why QLoRA is so memory-efficient.

Step 3: Prepare the Dataset

from datasets import load_dataset

# Load from local file
dataset = load_dataset("json", data_files="my_training_data.jsonl", split="train")

# Format into chat template
def format_example(example):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"] + ("\n" + example["input"] if example.get("input") else "")},
        {"role": "assistant", "content": example["output"]}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return {"text": text}

dataset = dataset.map(format_example)
print(f"Training examples: {len(dataset)}")
print(f"Sample:\n{dataset[0]['text'][:500]}")

Step 4: Configure Training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,          # Pack multiple examples into one sequence
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size: 2 * 4 = 8
        warmup_steps=5,
        max_steps=300,                  # Adjust based on dataset size
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        save_steps=100,
    ),
)

Step 5: Train

trainer_stats = trainer.train()

print(f"Training time: {trainer_stats.metrics['train_runtime']:.0f} seconds")
print(f"Samples/second: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")

Expected output on an RTX 3060 with 300 steps:

Training time: 420 seconds
Samples/second: 3.2
Final loss: 0.8124

Step 6: Test Before Exporting

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Test with a prompt from your domain
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Summarize this customer complaint: I've been waiting 3 weeks for my refund and nobody has responded to my emails."}
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluating Your Model {#evaluation}

Training loss going down doesn't mean your model is good. You need real evaluation.

Manual Evaluation (Most Important)

Create a test set of 20-50 examples the model has never seen. Run each through the model and score the outputs:

import json

test_examples = []
with open("test_data.jsonl") as f:
    for line in f:
        test_examples.append(json.loads(line))

results = []
for ex in test_examples:
    messages = [
        {"role": "user", "content": ex["instruction"] + "\n" + ex.get("input", "")}
    ]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
    outputs = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.3)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    results.append({
        "input": ex["instruction"],
        "expected": ex["output"],
        "generated": generated,
    })

# Save for manual review
with open("evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Saved {len(results)} evaluation results for manual review")

Automated Metrics

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

scores = []
for r in results:
    score = scorer.score(r["expected"], r["generated"])
    scores.append(score['rougeL'].fmeasure)

avg_rouge = sum(scores) / len(scores)
print(f"Average ROUGE-L: {avg_rouge:.3f}")
# Good: > 0.5 for summarization
# Good: > 0.7 for structured output

A/B Testing Against Base Model

The real question is: "Is my fine-tuned model better than the base model with a good prompt?" Always test this:

# Load base model (without fine-tuning) for comparison
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Run same test set through base model with your best system prompt
# Compare outputs side by side

If the fine-tuned model isn't clearly better than the base model + prompt engineering, don't deploy it. Iterate on your training data instead.


Deploying with Ollama {#deployment}

Step 1: Export to GGUF

# Save merged model (LoRA weights merged into base)
model.save_pretrained_merged(
    "finetuned-model",
    tokenizer,
    save_method="merged_16bit",
)

# Or save as separate LoRA adapter (smaller, portable)
model.save_pretrained("finetuned-lora")

Step 2: Convert to GGUF

# Clone llama.cpp if you haven't
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt

# Convert
python convert_hf_to_gguf.py ../finetuned-model \
  --outtype q4_K_M \
  --outfile finetuned-model-q4.gguf

Step 3: Create Ollama Model

# Write the Modelfile
cat > Modelfile << 'EOF'
FROM ./finetuned-model-q4.gguf
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|eot_id|>"
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant specialized in customer support ticket analysis."
EOF

# Create the model
ollama create my-custom-model -f Modelfile

# Test it
ollama run my-custom-model "Summarize: The product broke after 2 days and I want a replacement."

Step 4: Verify Deployment

# Check model is listed
ollama list | grep my-custom-model

# Test via API
curl http://localhost:11434/api/generate -d '{
  "model": "my-custom-model",
  "prompt": "Summarize this ticket: Customer reports login failures since the update.",
  "stream": false
}'

Your fine-tuned model is now available through Ollama's standard API, compatible with any Ollama client including Open WebUI.


Cost Analysis {#cost-analysis}

Local Fine-Tuning Costs

ComponentOne-Time CostPer-Training-Run
GPU (RTX 3060 used)$180-220$0
Electricity (1 hour)--$0.15-0.30
Software$0$0
Total (first run)~$200~$200
Total (subsequent)$0~$0.25

Cloud Fine-Tuning Comparison

Service7B Model, 500 StepsPer Run
OpenAI fine-tuning$15-50 (depending on dataset)$15-50
Together AI$5-20$5-20
Google Colab Pro$10/month (limited hours)~$3-5
Local (after GPU purchase)Electricity only~$0.25

After 10-15 fine-tuning runs, the local GPU has paid for itself versus cloud alternatives. If you iterate frequently on training data (which you should), local is dramatically cheaper.

Time Investment

PhaseFirst TimeSubsequent Runs
Environment setup30-60 min0 min (already done)
Data preparation2-8 hours30 min (iterating)
Training (300 steps, 7B)30-60 min30-60 min
Evaluation1-2 hours30 min
Deployment30 min5 min

Common Pitfalls {#pitfalls}

1. Training on AI-Generated Data Without Review

If you use GPT-4 or Claude to generate your training data, you're teaching your local model to mimic another AI -- including its errors, hallucinations, and biases. Always have a human review and correct AI-generated training examples.

2. Too Many Training Steps (Overfitting)

Symptom: Model gives perfect answers for training-like inputs but gibberish or repetitive outputs for novel inputs.

Fix: Reduce max_steps. Start with dataset_size / batch_size * 3 epochs. Monitor training loss -- if it drops below 0.1, you're almost certainly overfitting.

# Good heuristic for max_steps
num_examples = 500
batch_size = 8
epochs = 3
max_steps = (num_examples // batch_size) * epochs  # 187 steps

3. Too Little Data

50 examples is the absolute minimum for learning basic format changes. For domain adaptation, you need 200+. For complex multi-task behavior, 1,000+. If you have fewer than 100 examples, spend more time on data collection before training.

4. Inconsistent Training Data

If half your examples use formal tone and half use casual tone, the model will randomly switch between them. Audit your dataset for consistency in:

  • Output format (JSON, prose, bullet points)
  • Tone (formal, casual, technical)
  • Length (all short, all long, or intentionally varied)
  • Level of detail

5. Wrong Base Model

Starting from the wrong base model wastes capacity. Guidelines:

  • English general tasks: Llama 3.2 3B/8B
  • Multilingual: Qwen 2.5 7B/14B
  • Code: Qwen2.5-Coder or CodeLlama
  • Reasoning-heavy: DeepSeek R1 distill variants
  • Constrained hardware: Gemma 3 1B or Phi-4 Mini

6. Skipping Evaluation

"The loss went down so it must be working" is a trap. Loss measures how well the model predicts the next token on training data. It does not measure whether the model is useful for your actual task. Always run evaluation on a held-out test set.

7. LoRA Rank Too Low

With r=4 (very low rank), the adapter doesn't have enough capacity to learn complex behaviors. Start with r=16 for most tasks. Increase to 32 or 64 if you have enough VRAM and the model isn't learning the task. r=8 is fine for simple format changes.


Conclusion

Fine-tuning a model on your own data is no longer a research exercise. With QLoRA and Unsloth, the entire process -- from raw data to a deployed Ollama model -- takes 2-4 hours on a single consumer GPU.

The hard part isn't the code. It's the data. Spend 80% of your effort on data quality and 20% on training configuration. Start with 200 carefully reviewed examples, train for 300 steps, evaluate honestly, and iterate. Three rounds of data improvement plus retraining will outperform one round with 10x more data.

If you're just starting with local AI and haven't picked your hardware yet, our RAM requirements guide will help you size a machine that handles both inference and training. For the foundational LoRA concepts, see our LoRA fine-tuning guide.

The complete Unsloth documentation and model zoo are available at the Unsloth GitHub repository. Pre-quantized base models for common architectures are listed at HuggingFace's training documentation.


Ready to fine-tune but not sure which base model to start from? Our best local AI models for 8GB RAM guide ranks every major family by capability and memory usage.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Enjoyed this? There are 10 full courses waiting.

10 complete AI courses. From fundamentals to production. Everything runs on your hardware.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Want structured AI education?

10 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: April 10, 2026🔄 Last Updated: April 10, 2026✓ Manually Reviewed
PR

Written by Pattanaik Ramswarup

AI Engineer & Dataset Architect | Creator of the 77,000 Training Dataset

I've personally trained over 50 AI models from scratch and spent 2,000+ hours optimizing local AI deployments. My 77K dataset project revolutionized how businesses approach AI training. Every guide on this site is based on real hands-on experience, not theory. I test everything on my own hardware before writing about it.

✓ 10+ Years in ML/AI✓ 77K Dataset Creator✓ Open Source Contributor

Level Up Your Local AI Skills

Fine-tuning techniques, new model releases, and deployment strategies. Practical guides for engineers who run AI on their own hardware.

Build Real AI on Your Machine

RAG, agents, NLP, vision, MLOps — chapters across 10 courses that take you from reading about AI to building AI.

Was this helpful?

Related Guides

Continue your local AI journey with these comprehensive guides

Continue Learning

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators