What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) is a technique to fine-tune large models by training only small "adapter" matrices instead of all parameters. A 7B model has 7 billion parameters, but LoRA might only train 10-50 million—reducing memory and time by 90%+ while achieving similar quality to full fine-tuning.

How much VRAM do I need for LoRA training?

With QLoRA (quantized LoRA): 8GB for 7B models, 16GB for 13B-14B, 24GB for 32B-34B. Full LoRA (without quantization) needs roughly 2x more. Unsloth reduces requirements further—training 7B on 8GB is comfortable, 14B on 16GB is possible.

What's the difference between LoRA and QLoRA?

LoRA trains adapter weights while keeping base model in FP16/BF16. QLoRA keeps the base model in 4-bit quantization, dramatically reducing memory. QLoRA allows training 70B models on consumer GPUs (48GB+) that would otherwise need 140GB+. Quality is slightly lower but often negligible.

How much data do I need to fine-tune?

For specific tasks (like a particular writing style): 100-1000 examples often suffice. For broader capabilities: 10K-100K examples recommended. Quality matters more than quantity—1000 high-quality examples beats 10K noisy ones. Always include diverse examples to prevent overfitting.

Can I use my fine-tuned model with Ollama?

Yes! Export your LoRA as a GGUF file, then create an Ollama Modelfile pointing to it. The process: train with Unsloth → merge LoRA → convert to GGUF → register with Ollama. Your custom model then works like any other Ollama model.

What is the difference between LoRA rank (r) values?

LoRA rank controls adapter capacity. r=8: minimal capacity, fastest, lowest VRAM. r=16: balanced, most common choice. r=32: higher capacity, better for complex tasks. r=64+: diminishing returns for most tasks. Higher rank means: more trainable parameters, more VRAM, better potential quality, but risk of overfitting. Start with r=16, increase only if results are poor.

Can I stack multiple LoRA adapters?

Yes, you can load multiple LoRA adapters on the same base model. Each adapter adds its learned changes. Use cases: language adapter + task adapter, or combine specialized fine-tunes. However, too many adapters can degrade quality. Some frameworks support weighted adapter mixing (e.g., 70% adapter A + 30% adapter B).

How do I prevent overfitting during LoRA training?

Prevent overfitting by: 1) Using fewer epochs (often 1 is enough), 2) Adding dropout (0.05-0.1), 3) Reducing learning rate, 4) Using larger/more diverse datasets, 5) Early stopping based on validation loss, 6) Lower LoRA rank, 7) Data augmentation (paraphrasing, back-translation). Monitor validation loss—if it increases while training loss decreases, you're overfitting.

What base model should I fine-tune?

Choose based on your task: Llama 3.1 8B/70B for general tasks (great instruction following). Qwen 2.5 Coder for code. Mistral for fast inference needs. DeepSeek for reasoning tasks. Use the smallest model that achieves your quality goals—smaller models fine-tune faster and deploy easier. Pre-trained (not instruct) versions often fine-tune better for custom tasks.

How long should I train my LoRA?

Typical training: 1-3 epochs over your dataset, or 100-500 steps for small datasets. Monitor loss—it should decrease then plateau. Stop when validation loss stops improving. Overtrain and quality degrades. For 1K examples: ~100 steps often sufficient. For 10K examples: ~300-500 steps. Use early stopping to find optimal point automatically.

What is Unsloth and why is it faster?

Unsloth is a training framework that optimizes LoRA fine-tuning for speed and memory. It's 2x faster and uses 50% less VRAM than standard HuggingFace training through: custom CUDA kernels, memory-efficient attention, optimized backward passes, and smart gradient checkpointing. The API mirrors HuggingFace, so migration is easy. It's the recommended tool for consumer GPU training.

Can I fine-tune multimodal models with LoRA?

Yes, models like LLaVA, Qwen-VL, and Llama 4 support LoRA fine-tuning on vision+language tasks. The process is similar: freeze base model, train adapters on vision-language data. You can fine-tune just the language model part, just the vision encoder, or both. Requires more VRAM due to vision components. Use cases: domain-specific image understanding, custom OCR, visual QA.

LoRA Fine-Tuning Local Guide: Train Custom AI Models on Your GPU

LoRA (Low-Rank Adaptation) is a fine-tuning technique that trains only 0.1-1% of a model's parameters using small adapter matrices, reducing VRAM requirements by 90%. With QLoRA, you can fine-tune a 7B model on just 8GB VRAM in under 30 minutes using Unsloth. LoRA achieves 90%+ of full fine-tuning quality while keeping the base model frozen, and the resulting adapter deploys directly to Ollama as a custom model.

LoRA Training Quick Start

3-Step Process:
1. pip install unsloth
2. Prepare dataset (JSONL format)
3. Run training script (~1 hour on RTX 4090)

What is LoRA?

LoRA (Low-Rank Adaptation) fine-tunes large models efficiently by:

Training small adapter matrices (~0.1-1% of parameters)
Keeping base model frozen
Achieving 90%+ of full fine-tuning quality
Using 10x less memory

Full Fine-tuning:  Train all 7B parameters → 28GB+ VRAM
LoRA Fine-tuning:  Train ~50M parameters → 8GB VRAM

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

VRAM Requirements

Model Size	Full Fine-tune	LoRA	QLoRA
7B	28GB	16GB	8GB
13B	52GB	24GB	12GB
32B	128GB	48GB	24GB
70B	280GB	96GB	48GB

QLoRA makes consumer GPU training practical.

Setting Up Unsloth (Fastest Method)

Installation

pip install unsloth
pip install --upgrade transformers datasets

Basic Training Script

from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl")

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    max_steps=100,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=50,
)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    args=training_args,
    max_seq_length=2048,
)

trainer.train()

# Save LoRA
model.save_pretrained("./my_lora")

Preparing Your Dataset

Dataset Format (JSONL)

{"instruction": "Write a poem about AI", "input": "", "output": "Silicon dreams..."}
{"instruction": "Explain quantum computing", "input": "", "output": "Quantum computing uses..."}
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"}

Conversation Format

{"conversations": [
  {"role": "user", "content": "What is AI?"},
  {"role": "assistant", "content": "AI is..."}
]}

Dataset Tips

Quality > Quantity: 500 excellent examples > 5000 mediocre
Diversity: Cover all variations of your use case
Format consistency: Same structure throughout
Length variety: Mix short and long responses

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

LoRA Hyperparameters

Parameter	Recommended	Description
r (rank)	8-32	Higher = more capacity, more VRAM
alpha	16-32	Scaling factor, usually 2x rank
dropout	0-0.1	Regularization
target_modules	All attention + MLP	What to adapt
lr	1e-4 to 2e-4	Learning rate
epochs	1-3	Often 1 is enough

Merging and Deploying

Merge LoRA with Base Model

# Merge LoRA adapters into base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")

Convert to GGUF for Ollama

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Convert to GGUF
python convert.py ../merged_model --outtype f16 --outfile model.gguf

# Quantize
./quantize model.gguf model-q4.gguf q4_k_m

Create Ollama Model

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4.gguf
SYSTEM "You are a helpful assistant fine-tuned for specific tasks."
EOF

# Register with Ollama
ollama create my-custom-model -f Modelfile

# Run your model
ollama run my-custom-model

Training Time Estimates

Model	Dataset	GPU	Time
7B	1K examples	RTX 4090	15 min
7B	10K examples	RTX 4090	2 hours
14B	1K examples	RTX 4090	30 min
32B	1K examples	RTX 4090	1 hour

Unsloth is ~2x faster than standard training.

Common Use Cases

1. Custom Writing Style

Train on your existing content to match your voice.

2. Domain Expert

Train on medical, legal, or technical documents.

3. Company Knowledge

Train on internal docs for support chatbot.

4. Language Adaptation

Improve performance in specific languages.

5. Task Specialist

Optimize for specific tasks (summarization, extraction).

Troubleshooting

Out of Memory

Reduce batch size to 1
Lower LoRA rank (r=8)
Use gradient checkpointing
Reduce max_seq_length

Poor Results

More/better training data
Lower learning rate
Train longer (more steps)
Check data formatting

Overfitting

Fewer epochs
Add dropout (0.05-0.1)
More diverse data

Key Takeaways

LoRA trains 1% of parameters for 90% of results
QLoRA enables training on consumer GPUs
Unsloth is the fastest training framework
Quality data matters more than quantity
Deploy to Ollama for easy local use
8GB VRAM can train 7B models with QLoRA

Next Steps

Set up Ollama to run your model
Compare models to choose a base
Build RAG with your fine-tuned model

LoRA democratizes AI customization—you can now create specialized AI models on consumer hardware that rival expensive cloud fine-tuning.

LoRA Fine-Tuning Local Guide: Train Custom AI Models

Want to go deeper than this article?

LoRA Training Quick Start

What is LoRA?

Reading articles is good. Building is better.

VRAM Requirements

Setting Up Unsloth (Fastest Method)

Installation

Basic Training Script

Preparing Your Dataset

Dataset Format (JSONL)

Conversation Format

Dataset Tips

Reading articles is good. Building is better.

LoRA Hyperparameters

Merging and Deploying

Merge LoRA with Base Model

Convert to GGUF for Ollama

Create Ollama Model

Training Time Estimates

Common Use Cases

1. Custom Writing Style

2. Domain Expert

3. Company Knowledge

4. Language Adaptation

5. Task Specialist

Troubleshooting

Out of Memory

Poor Results

Overfitting

Key Takeaways

Next Steps

Go from reading about AI to building with AI

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Fine-Tuning Starter Kit

Build Real AI on Your Machine

Related Guides

Best Open Source LLMs

Best GPUs for AI

Ollama Setup

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI