LoRA Fine-Tuning Local Guide: Train Custom AI Models
Before we dive deeper...
Get your free AI Starter Kit
Join 12,000+ developers. Instant download: Career Roadmap + Fundamentals Cheat Sheets.
LoRA Training Quick Start
3-Step Process:
1. pip install unsloth
2. Prepare dataset (JSONL format)
3. Run training script (~1 hour on RTX 4090)
What is LoRA?
LoRA (Low-Rank Adaptation) fine-tunes large models efficiently by:
- Training small adapter matrices (~0.1-1% of parameters)
- Keeping base model frozen
- Achieving 90%+ of full fine-tuning quality
- Using 10x less memory
Full Fine-tuning: Train all 7B parameters â 28GB+ VRAM
LoRA Fine-tuning: Train ~50M parameters â 8GB VRAM
VRAM Requirements
| Model Size | Full Fine-tune | LoRA | QLoRA |
|---|---|---|---|
| 7B | 28GB | 16GB | 8GB |
| 13B | 52GB | 24GB | 12GB |
| 32B | 128GB | 48GB | 24GB |
| 70B | 280GB | 96GB | 48GB |
QLoRA makes consumer GPU training practical.
Setting Up Unsloth (Fastest Method)
Installation
pip install unsloth
pip install --upgrade transformers datasets
Basic Training Script
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl")
# Training arguments
training_args = TrainingArguments(
output_dir="./lora_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=100,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=50,
)
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
args=training_args,
max_seq_length=2048,
)
trainer.train()
# Save LoRA
model.save_pretrained("./my_lora")
Preparing Your Dataset
Dataset Format (JSONL)
{"instruction": "Write a poem about AI", "input": "", "output": "Silicon dreams..."}
{"instruction": "Explain quantum computing", "input": "", "output": "Quantum computing uses..."}
{"instruction": "Translate to French", "input": "Hello world", "output": "Bonjour le monde"}
Conversation Format
{"conversations": [
{"role": "user", "content": "What is AI?"},
{"role": "assistant", "content": "AI is..."}
]}
Dataset Tips
- Quality > Quantity: 500 excellent examples > 5000 mediocre
- Diversity: Cover all variations of your use case
- Format consistency: Same structure throughout
- Length variety: Mix short and long responses
LoRA Hyperparameters
| Parameter | Recommended | Description |
|---|---|---|
| r (rank) | 8-32 | Higher = more capacity, more VRAM |
| alpha | 16-32 | Scaling factor, usually 2x rank |
| dropout | 0-0.1 | Regularization |
| target_modules | All attention + MLP | What to adapt |
| lr | 1e-4 to 2e-4 | Learning rate |
| epochs | 1-3 | Often 1 is enough |
Merging and Deploying
Merge LoRA with Base Model
# Merge LoRA adapters into base model
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged_model")
tokenizer.save_pretrained("./merged_model")
Convert to GGUF for Ollama
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Convert to GGUF
python convert.py ../merged_model --outtype f16 --outfile model.gguf
# Quantize
./quantize model.gguf model-q4.gguf q4_k_m
Create Ollama Model
# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4.gguf
SYSTEM "You are a helpful assistant fine-tuned for specific tasks."
EOF
# Register with Ollama
ollama create my-custom-model -f Modelfile
# Run your model
ollama run my-custom-model
Training Time Estimates
| Model | Dataset | GPU | Time |
|---|---|---|---|
| 7B | 1K examples | RTX 4090 | 15 min |
| 7B | 10K examples | RTX 4090 | 2 hours |
| 14B | 1K examples | RTX 4090 | 30 min |
| 32B | 1K examples | RTX 4090 | 1 hour |
Unsloth is ~2x faster than standard training.
Common Use Cases
1. Custom Writing Style
Train on your existing content to match your voice.
2. Domain Expert
Train on medical, legal, or technical documents.
3. Company Knowledge
Train on internal docs for support chatbot.
4. Language Adaptation
Improve performance in specific languages.
5. Task Specialist
Optimize for specific tasks (summarization, extraction).
Troubleshooting
Out of Memory
- Reduce batch size to 1
- Lower LoRA rank (r=8)
- Use gradient checkpointing
- Reduce max_seq_length
Poor Results
- More/better training data
- Lower learning rate
- Train longer (more steps)
- Check data formatting
Overfitting
- Fewer epochs
- Add dropout (0.05-0.1)
- More diverse data
Key Takeaways
- LoRA trains 1% of parameters for 90% of results
- QLoRA enables training on consumer GPUs
- Unsloth is the fastest training framework
- Quality data matters more than quantity
- Deploy to Ollama for easy local use
- 8GB VRAM can train 7B models with QLoRA
Next Steps
- Set up Ollama to run your model
- Compare models to choose a base
- Build RAG with your fine-tuned model
LoRA democratizes AI customizationâyou can now create specialized AI models on consumer hardware that rival expensive cloud fine-tuning.
Ready to start your AI career?
Get the complete roadmap
Download the AI Starter Kit: Career path, fundamentals, and cheat sheets used by 12K+ developers.
Want structured AI education?
10 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!