How much VRAM do I need to fine-tune a 7B model?

With QLoRA (4-bit quantized base + LoRA adapters), you need approximately 6GB VRAM to fine-tune a 7B model. An RTX 3060 12GB handles 7B comfortably with room to spare. A 4GB GPU can fine-tune 3B models. The key enabler is QLoRA, which loads the base model in 4-bit precision while training 16-bit LoRA adapters.

How many training examples do I need for fine-tuning?

Minimum 50 for simple format changes, 200 for domain adaptation, and 500-5,000 for complex behavior changes. Quality matters more than quantity. 200 human-verified, consistent examples will outperform 5,000 unreviewed AI-generated ones. Always hold out 10-20% as a test set for evaluation.

Should I fine-tune or use RAG?

Use RAG when you need factual answers from a document collection that changes over time. Use fine-tuning when you need to change the model's behavior, tone, or output format permanently. Common mistake: fine-tuning to add knowledge (use RAG) or building RAG for style changes (use fine-tuning). Many production systems combine both.

What is QLoRA and why does it need less VRAM?

QLoRA (Quantized Low-Rank Adaptation) loads the base model in 4-bit precision (reducing a 7B model from 14GB to ~3.5GB in memory) while training small 16-bit adapter matrices that modify the model's behavior. Only 1-3% of total parameters are trainable. This combination slashes VRAM requirements from ~60GB (full fine-tune) to ~6GB for a 7B model.

How long does fine-tuning take on consumer hardware?

On an RTX 3060 12GB, fine-tuning a 7B model for 300 steps takes approximately 30-60 minutes. A 3B model takes 15-25 minutes for 300 steps. Training time scales linearly with step count and roughly quadratically with model size. On Apple Silicon, expect 3-4x slower than equivalent NVIDIA GPUs.

Can I fine-tune on Apple Silicon Macs?

Yes, using MLX or HuggingFace with MPS backend. However, training is 3-4x slower than NVIDIA CUDA GPUs. An M3 Pro processes about 1.4 samples/sec versus 5.8 on an RTX 4070. For occasional fine-tuning, a Mac works. For frequent iteration, an NVIDIA GPU is strongly recommended.

How do I deploy a fine-tuned model with Ollama?

Export the merged model to HuggingFace format, convert to GGUF using llama.cpp's convert script (python convert_hf_to_gguf.py), write a Modelfile pointing to the GGUF file with appropriate chat template, then run 'ollama create my-model -f Modelfile'. Your model is then accessible through Ollama's standard API.

What happens if I train for too many steps?

The model overfits: it memorizes training examples and produces excellent outputs for training-like inputs but poor, repetitive, or incoherent outputs for novel inputs. A good heuristic is (dataset_size / batch_size) * 3 epochs for max_steps. If training loss drops below 0.1, you're likely overfitting. Always evaluate on held-out test data.

Fine-Tune AI Models with Your Own Data Locally

Published on April 10, 2026 • 26 min read

The Short Version

Fine-tune a model on your data in four steps:

Prepare data: Format as JSONL with instruction/input/output fields
Install Unsloth: pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Train: Load base model with QLoRA, train for 100-500 steps
Deploy: Export to GGUF, create Ollama model, serve locally

Total time: 30-90 minutes on a single consumer GPU. Cost: electricity only.

What this guide covers:

When fine-tuning beats prompting and RAG (and when it doesn't)
Data preparation: format, quality, minimum dataset size
LoRA vs QLoRA vs full fine-tuning explained
Complete Unsloth training pipeline with real code
Evaluation methods that actually tell you if it worked
Deploying your custom model through Ollama
Common pitfalls that waste your time

Fine-tuning is the process of taking a pre-trained model and teaching it new behaviors using your own data. You're not training from scratch -- you're adjusting an existing model's knowledge to match your specific needs. A customer support bot that knows your product catalog. A code assistant that follows your team's conventions. A medical summarizer that uses your institution's terminology.

The barrier to entry dropped dramatically over the past year. QLoRA (Quantized Low-Rank Adaptation) lets you fine-tune a 7B parameter model on a GPU with just 4-6GB of VRAM. Unsloth makes the training process 2x faster than standard HuggingFace implementations. You no longer need cloud compute or expensive hardware.

For the LoRA fundamentals this guide builds on, see our LoRA fine-tuning local guide. If you're not sure whether fine-tuning is right for your use case, start with the decision framework in the next section.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Fine-Tuning vs Prompting vs RAG
Data Preparation
LoRA vs QLoRA vs Full Fine-Tuning
Hardware Requirements
Unsloth Setup
Step-by-Step Training
Evaluating Your Model
Deploying with Ollama
Cost Analysis
Common Pitfalls

Fine-Tuning vs Prompting vs RAG {#when-to-fine-tune}

Three approaches exist to customize AI behavior. Choosing wrong wastes weeks.

Decision Framework

Approach	Best When	Dataset Size	Effort	Result
Prompting	You need specific output format	0 examples	10 minutes	Good enough for many tasks
RAG	You need current, factual answers from documents	Any doc count	1-2 hours setup	Great for knowledge bases
Fine-tuning	You need changed model behavior/style	100-10,000 examples	2-8 hours	Permanent behavior change

When Fine-Tuning Is the Right Choice

Fine-tune when you need:

Consistent output format that prompting can't enforce (e.g., always respond in specific JSON schema)
Domain-specific language the base model doesn't understand (legal, medical, proprietary jargon)
A specific tone or personality that system prompts can't reliably maintain
Reduced inference cost (fine-tuned models need shorter prompts)
Tasks where the model consistently fails despite good prompting

When Fine-Tuning Is the Wrong Choice

Don't fine-tune when:

Your data changes frequently (use RAG instead -- it handles updates without retraining)
You need factual accuracy from a knowledge base (RAG retrieves facts; fine-tuning memorizes patterns)
A well-crafted system prompt already gets 90%+ accuracy (engineering the prompt is faster)
You have fewer than 50 high-quality examples (the model won't learn meaningful patterns)
You want to add "knowledge" to the model (that's what RAG is for)

The most common mistake: people fine-tune to add knowledge when they should use RAG, or they build RAG pipelines when they need style/format changes that fine-tuning handles trivially.

Data Preparation {#data-preparation}

Data quality is everything. A small dataset of 200 excellent examples will outperform 5,000 sloppy ones. Period.

Data Format

The standard format is JSONL (JSON Lines) with instruction-input-output triples:

{"instruction": "Summarize this customer complaint", "input": "I ordered product X three weeks ago and it still hasn't arrived. I've called twice and nobody can tell me where it is. This is unacceptable.", "output": "Issue: Delayed delivery (3+ weeks). Product: X. Previous contact: 2 calls, unresolved. Sentiment: Frustrated. Priority: High."}
{"instruction": "Summarize this customer complaint", "input": "The software crashes every time I try to export to PDF. I'm running version 4.2 on Windows 11.", "output": "Issue: PDF export crash (reproducible). Software: v4.2. OS: Windows 11. Sentiment: Neutral. Priority: Medium."}

Alternative Format: Chat/Conversation

{"conversations": [{"role": "system", "content": "You are a medical coding assistant."}, {"role": "user", "content": "Patient presents with acute lower back pain radiating to left leg, worsened by sitting."}, {"role": "assistant", "content": "Suggested ICD-10 codes:\n- M54.5: Low back pain\n- M54.41: Sciatica, left side\n\nPrimary: M54.41 (radiculopathy pattern suggests sciatica as primary)"}]}

Data Quality Checklist

Consistent format -- Every example follows the same structure
Correct outputs -- Human-verified, not AI-generated without review
Representative -- Cover the full range of inputs your model will encounter
Diverse -- Avoid too many similar examples (model will overfit)
Clean -- No typos in outputs, no contradictory examples, no truncated text

Minimum Dataset Size

Task Complexity	Minimum Examples	Recommended	Steps
Simple format change	50	200	100
Domain adaptation	200	1,000	300
Complex behavior	500	2,000-5,000	500-1,000
Multi-task	1,000+	5,000-10,000	1,000+

Data Preparation Script

import json

def validate_dataset(filepath):
    """Validate a JSONL dataset for fine-tuning."""
    issues = []
    examples = []

    with open(filepath, 'r') as f:
        for i, line in enumerate(f):
            try:
                data = json.loads(line)
                examples.append(data)

                # Check required fields
                if 'instruction' not in data:
                    issues.append(f"Line {i+1}: Missing 'instruction' field")
                if 'output' not in data:
                    issues.append(f"Line {i+1}: Missing 'output' field")

                # Check for empty outputs
                if data.get('output', '').strip() == '':
                    issues.append(f"Line {i+1}: Empty output")

                # Check output length
                if len(data.get('output', '')) < 10:
                    issues.append(f"Line {i+1}: Very short output ({len(data.get('output', ''))} chars)")

            except json.JSONDecodeError:
                issues.append(f"Line {i+1}: Invalid JSON")

    print(f"Total examples: {len(examples)}")
    print(f"Issues found: {len(issues)}")
    for issue in issues[:20]:
        print(f"  - {issue}")

    # Check for duplicates
    outputs = [e.get('output', '') for e in examples]
    dupes = len(outputs) - len(set(outputs))
    if dupes:
        print(f"  - {dupes} duplicate outputs detected")

    return len(issues) == 0

validate_dataset("my_training_data.jsonl")

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

LoRA vs QLoRA vs Full Fine-Tuning {#training-methods}

Full Fine-Tuning

Updates every weight in the model. Produces the best results but requires enormous resources.

Model Size	VRAM Needed	Training Time	When to Use
1B	~12GB	~1 hour	Almost never locally
7B	~60GB	~6 hours	Multi-GPU setups only
13B	~120GB	~12 hours	Server/cloud only

Verdict: Not practical on consumer hardware for models above 1B.

LoRA (Low-Rank Adaptation)

Freezes the base model and trains small adapter matrices that modify the model's behavior. Dramatically reduces memory and training time.

Model Size	VRAM Needed	Adapter Size	Quality vs Full
7B	~14GB	20-100MB	95-98%
13B	~28GB	30-150MB	95-98%

How LoRA works: Instead of updating a weight matrix W (e.g., 4096x4096 = 16M parameters), LoRA decomposes the update into two small matrices A and B where W' = W + AB. With rank r=16, that's only 409616 + 16*4096 = 131K parameters. You train 131K values instead of 16M.

QLoRA (Quantized LoRA)

Same concept as LoRA, but the base model is loaded in 4-bit quantization. This is the breakthrough that makes consumer-GPU fine-tuning possible.

Model Size	VRAM Needed	Quality vs Full	Speed
3B	~4GB	93-96%	Fast
7B	~6GB	93-96%	Moderate
13B	~12GB	93-96%	Slow
32B	~24GB	93-96%	Very slow

Verdict: QLoRA is the standard for local fine-tuning. The 3-5% quality gap versus full fine-tuning is rarely noticeable in practice.

Hardware Requirements {#hardware-requirements}

Minimum Specs for QLoRA Fine-Tuning

Model Size	Min VRAM	Min RAM	Storage	Time (500 steps)
3B	4GB	8GB	10GB	15-25 min
7-8B	6GB	16GB	20GB	30-60 min
13-14B	10GB	16GB	40GB	1-2 hours
32B	22GB	32GB	80GB	3-6 hours
70B	44GB	64GB	160GB	8-16 hours

Tested GPU Performance

GPU	VRAM	Max Model (QLoRA)	7B Training Speed
RTX 3060	12GB	13B	3.2 samples/sec
RTX 4060 Ti	8GB	7B	4.1 samples/sec
RTX 4070	12GB	13B	5.8 samples/sec
RTX 4090	24GB	32B	9.4 samples/sec
RTX 5090	32GB	70B (tight)	11.2 samples/sec

Apple Silicon

Apple Silicon can fine-tune via MLX, but it's significantly slower than CUDA:

Mac	RAM	Max Model	7B Speed
M1 16GB	16GB	7B	0.8 samples/sec
M3 Pro 18GB	18GB	7B	1.4 samples/sec
M3 Max 36GB	36GB	14B	1.8 samples/sec
M4 Pro 24GB	24GB	13B	1.6 samples/sec

An RTX 3060 trains 4x faster than an M3 Pro for the same model. NVIDIA is the clear choice for fine-tuning specifically. For inference-only workloads, Apple Silicon is more competitive. See our RAM requirements guide for inference-focused hardware sizing.

Unsloth Setup {#unsloth-setup}

Unsloth is the fastest open-source fine-tuning framework. It provides 2x training speed over standard HuggingFace, automatic memory optimization, and pre-quantized model loading.

Installation

# Create a clean environment
conda create -n finetune python=3.11
conda activate finetune

# Install PyTorch (CUDA)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install Unsloth
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify CUDA
python -c "import torch; print(torch.cuda.is_available())"
# Should print: True

Verify Setup

from unsloth import FastLanguageModel
import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Step-by-Step Training {#training-pipeline}

Here's a complete, working fine-tuning pipeline. This example fine-tunes Llama 3.2 3B on a custom instruction dataset.

Step 1: Load the Base Model

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,            # Auto-detect
    load_in_4bit=True,     # QLoRA
)

This loads the 3B model in 4-bit precision, consuming about 2.5GB VRAM. Unsloth provides pre-quantized checkpoints for all popular models.

Step 2: Add LoRA Adapters

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                  # LoRA rank (8-64, higher = more capacity)
    target_modules=[       # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha=16,         # Scaling factor (typically = r)
    lora_dropout=0,        # 0 for Unsloth (handled differently)
    bias="none",
    use_gradient_checkpointing="unsloth",  # 60% less VRAM
    random_state=42,
)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 3,254,779,904 || trainable%: 1.29

Only 1.29% of parameters are trainable. This is why QLoRA is so memory-efficient.

Step 3: Prepare the Dataset

from datasets import load_dataset

# Load from local file
dataset = load_dataset("json", data_files="my_training_data.jsonl", split="train")

# Format into chat template
def format_example(example):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": example["instruction"] + ("\n" + example["input"] if example.get("input") else "")},
        {"role": "assistant", "content": example["output"]}
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return {"text": text}

dataset = dataset.map(format_example)
print(f"Training examples: {len(dataset)}")
print(f"Sample:\n{dataset[0]['text'][:500]}")

Step 4: Configure Training

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=True,          # Pack multiple examples into one sequence
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size: 2 * 4 = 8
        warmup_steps=5,
        max_steps=300,                  # Adjust based on dataset size
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
        save_steps=100,
    ),
)

Step 5: Train

trainer_stats = trainer.train()

print(f"Training time: {trainer_stats.metrics['train_runtime']:.0f} seconds")
print(f"Samples/second: {trainer_stats.metrics['train_samples_per_second']:.1f}")
print(f"Final loss: {trainer_stats.metrics['train_loss']:.4f}")

Expected output on an RTX 3060 with 300 steps:

Training time: 420 seconds
Samples/second: 3.2
Final loss: 0.8124

Step 6: Test Before Exporting

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Test with a prompt from your domain
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Summarize this customer complaint: I've been waiting 3 weeks for my refund and nobody has responded to my emails."}
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Evaluating Your Model {#evaluation}

Training loss going down doesn't mean your model is good. You need real evaluation.

Manual Evaluation (Most Important)

Create a test set of 20-50 examples the model has never seen. Run each through the model and score the outputs:

import json

test_examples = []
with open("test_data.jsonl") as f:
    for line in f:
        test_examples.append(json.loads(line))

results = []
for ex in test_examples:
    messages = [
        {"role": "user", "content": ex["instruction"] + "\n" + ex.get("input", "")}
    ]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
    outputs = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.3)
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

    results.append({
        "input": ex["instruction"],
        "expected": ex["output"],
        "generated": generated,
    })

# Save for manual review
with open("evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Saved {len(results)} evaluation results for manual review")

Automated Metrics

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

scores = []
for r in results:
    score = scorer.score(r["expected"], r["generated"])
    scores.append(score['rougeL'].fmeasure)

avg_rouge = sum(scores) / len(scores)
print(f"Average ROUGE-L: {avg_rouge:.3f}")
# Good: > 0.5 for summarization
# Good: > 0.7 for structured output

A/B Testing Against Base Model

The real question is: "Is my fine-tuned model better than the base model with a good prompt?" Always test this:

# Load base model (without fine-tuning) for comparison
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Run same test set through base model with your best system prompt
# Compare outputs side by side

If the fine-tuned model isn't clearly better than the base model + prompt engineering, don't deploy it. Iterate on your training data instead.

Deploying with Ollama {#deployment}

Step 1: Export to GGUF

# Save merged model (LoRA weights merged into base)
model.save_pretrained_merged(
    "finetuned-model",
    tokenizer,
    save_method="merged_16bit",
)

# Or save as separate LoRA adapter (smaller, portable)
model.save_pretrained("finetuned-lora")

Step 2: Convert to GGUF

# Clone llama.cpp if you haven't
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt

# Convert
python convert_hf_to_gguf.py ../finetuned-model \
  --outtype q4_K_M \
  --outfile finetuned-model-q4.gguf

Step 3: Create Ollama Model

# Write the Modelfile
cat > Modelfile << 'EOF'
FROM ./finetuned-model-q4.gguf
TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER stop "<|eot_id|>"
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant specialized in customer support ticket analysis."
EOF

# Create the model
ollama create my-custom-model -f Modelfile

# Test it
ollama run my-custom-model "Summarize: The product broke after 2 days and I want a replacement."

Step 4: Verify Deployment

# Check model is listed
ollama list | grep my-custom-model

# Test via API
curl http://localhost:11434/api/generate -d '{
  "model": "my-custom-model",
  "prompt": "Summarize this ticket: Customer reports login failures since the update.",
  "stream": false
}'

Your fine-tuned model is now available through Ollama's standard API, compatible with any Ollama client including Open WebUI.

Cost Analysis {#cost-analysis}

Local Fine-Tuning Costs

Component	One-Time Cost	Per-Training-Run
GPU (RTX 3060 used)	$180-220	$0
Electricity (1 hour)	--	$0.15-0.30
Software	$0	$0
Total (first run)	~$200	~$200
Total (subsequent)	$0	~$0.25

Cloud Fine-Tuning Comparison

Service	7B Model, 500 Steps	Per Run
OpenAI fine-tuning	$15-50 (depending on dataset)	$15-50
Together AI	$5-20	$5-20
Google Colab Pro	$10/month (limited hours)	~$3-5
Local (after GPU purchase)	Electricity only	~$0.25

After 10-15 fine-tuning runs, the local GPU has paid for itself versus cloud alternatives. If you iterate frequently on training data (which you should), local is dramatically cheaper.

Time Investment

Phase	First Time	Subsequent Runs
Environment setup	30-60 min	0 min (already done)
Data preparation	2-8 hours	30 min (iterating)
Training (300 steps, 7B)	30-60 min	30-60 min
Evaluation	1-2 hours	30 min
Deployment	30 min	5 min

Common Pitfalls {#pitfalls}

1. Training on AI-Generated Data Without Review

If you use GPT-4 or Claude to generate your training data, you're teaching your local model to mimic another AI -- including its errors, hallucinations, and biases. Always have a human review and correct AI-generated training examples.

2. Too Many Training Steps (Overfitting)

Symptom: Model gives perfect answers for training-like inputs but gibberish or repetitive outputs for novel inputs.

Fix: Reduce max_steps. Start with dataset_size / batch_size * 3 epochs. Monitor training loss -- if it drops below 0.1, you're almost certainly overfitting.

# Good heuristic for max_steps
num_examples = 500
batch_size = 8
epochs = 3
max_steps = (num_examples // batch_size) * epochs  # 187 steps

3. Too Little Data

50 examples is the absolute minimum for learning basic format changes. For domain adaptation, you need 200+. For complex multi-task behavior, 1,000+. If you have fewer than 100 examples, spend more time on data collection before training.

4. Inconsistent Training Data

If half your examples use formal tone and half use casual tone, the model will randomly switch between them. Audit your dataset for consistency in:

Output format (JSON, prose, bullet points)
Tone (formal, casual, technical)
Length (all short, all long, or intentionally varied)
Level of detail

5. Wrong Base Model

Starting from the wrong base model wastes capacity. Guidelines:

English general tasks: Llama 3.2 3B/8B
Multilingual: Qwen 2.5 7B/14B
Code: Qwen2.5-Coder or CodeLlama
Reasoning-heavy: DeepSeek R1 distill variants
Constrained hardware: Gemma 3 1B or Phi-4 Mini

6. Skipping Evaluation

"The loss went down so it must be working" is a trap. Loss measures how well the model predicts the next token on training data. It does not measure whether the model is useful for your actual task. Always run evaluation on a held-out test set.

7. LoRA Rank Too Low

With r=4 (very low rank), the adapter doesn't have enough capacity to learn complex behaviors. Start with r=16 for most tasks. Increase to 32 or 64 if you have enough VRAM and the model isn't learning the task. r=8 is fine for simple format changes.

Conclusion

Fine-tuning a model on your own data is no longer a research exercise. With QLoRA and Unsloth, the entire process -- from raw data to a deployed Ollama model -- takes 2-4 hours on a single consumer GPU.

The hard part isn't the code. It's the data. Spend 80% of your effort on data quality and 20% on training configuration. Start with 200 carefully reviewed examples, train for 300 steps, evaluate honestly, and iterate. Three rounds of data improvement plus retraining will outperform one round with 10x more data.

If you're just starting with local AI and haven't picked your hardware yet, our RAM requirements guide will help you size a machine that handles both inference and training. For the foundational LoRA concepts, see our LoRA fine-tuning guide.

The complete Unsloth documentation and model zoo are available at the Unsloth GitHub repository. Pre-quantized base models for common architectures are listed at HuggingFace's training documentation.

Ready to fine-tune but not sure which base model to start from? Our best local AI models for 8GB RAM guide ranks every major family by capability and memory usage.

Fine-Tune AI Models with Your Own Data Locally

Want to go deeper than this article?

The Short Version

Reading articles is good. Building is better.

Table of Contents

Fine-Tuning vs Prompting vs RAG {#when-to-fine-tune}

Decision Framework

When Fine-Tuning Is the Right Choice

When Fine-Tuning Is the Wrong Choice

Data Preparation {#data-preparation}

Data Format

Alternative Format: Chat/Conversation

Data Quality Checklist

Minimum Dataset Size

Data Preparation Script

Reading articles is good. Building is better.

LoRA vs QLoRA vs Full Fine-Tuning {#training-methods}

Full Fine-Tuning

LoRA (Low-Rank Adaptation)

QLoRA (Quantized LoRA)

Hardware Requirements {#hardware-requirements}

Minimum Specs for QLoRA Fine-Tuning

Tested GPU Performance

Apple Silicon

Unsloth Setup {#unsloth-setup}

Installation

Verify Setup

Step-by-Step Training {#training-pipeline}

Step 1: Load the Base Model

Step 2: Add LoRA Adapters

Step 3: Prepare the Dataset

Step 4: Configure Training

Step 5: Train

Step 6: Test Before Exporting

Evaluating Your Model {#evaluation}

Manual Evaluation (Most Important)

Automated Metrics

A/B Testing Against Base Model

Deploying with Ollama {#deployment}

Step 1: Export to GGUF

Step 2: Convert to GGUF

Step 3: Create Ollama Model

Step 4: Verify Deployment

Cost Analysis {#cost-analysis}

Local Fine-Tuning Costs

Cloud Fine-Tuning Comparison

Time Investment

Common Pitfalls {#pitfalls}

1. Training on AI-Generated Data Without Review

2. Too Many Training Steps (Overfitting)

3. Too Little Data

4. Inconsistent Training Data

5. Wrong Base Model

6. Skipping Evaluation

7. LoRA Rank Too Low

Conclusion

Go from reading about AI to building with AI

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Written by the Local AI Master Team

Level Up Your Local AI Skills

Build Real AI on Your Machine

🎓 Continue Learning

Related Guides

Continue Learning

LoRA Fine-Tuning Guide

RAM Requirements Guide

Best Models for 8GB RAM

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI