QLoRA Fine-Tuning Complete Guide (2026): Train 70B Models on a Single 24GB GPU
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
QLoRA is the technique that made fine-tuning 70B models on consumer GPUs practical. 4-bit quantize the frozen base, train LoRA adapters in BF16 on top. The result: Llama 3.1 70B fine-tunes on a single RTX 4090. Quality typically matches LoRA at ~30% of full-fine-tuning memory. Pair with Unsloth or Axolotl for production-grade training pipelines.
This guide covers everything: how QLoRA works mathematically, hardware requirements per model size, dataset preparation, the Unsloth / Axolotl / TRL ecosystem, hyperparameter tuning, common training failures, and merging adapters for deployment.
Table of Contents
- What QLoRA Is
- QLoRA vs LoRA vs Full Fine-Tuning
- Hardware Requirements by Model Size
- Choosing a Framework: Unsloth, Axolotl, TRL
- Dataset Preparation
- Unsloth Walkthrough
- Axolotl Walkthrough
- Hyperparameter Tuning
- Training Loop Monitoring
- Merging LoRA Adapter
- Multi-LoRA Serving (vLLM)
- Multi-GPU QLoRA
- Vision-Language QLoRA
- Variants: LoRA-FA, GaLore, DoRA
- Common Failures
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What QLoRA Is {#what-it-is}
Standard fine-tuning updates all weights — needs ~4x the model size in optimizer states (Adam: weights + grads + 2x momentum). 70B model = 280 GB memory minimum.
LoRA: freeze base, train small low-rank adapter matrices A·B (rank r ≪ d). Memory: ~base size + small adapter overhead. ~80 GB for 70B in BF16 + adapters.
QLoRA (Dettmers et al., 2023): 4-bit NF4 quantize the frozen base, train LoRA in BF16. Memory: ~40 GB for 70B base + adapters. Fits a single 24GB+ card with paged optimizer.
Quality: typically within 1% of LoRA, within 2-3% of full fine-tuning.
QLoRA vs LoRA vs Full Fine-Tuning {#vs-others}
| Method | VRAM (70B) | Quality | Speed |
|---|---|---|---|
| Full fine-tuning | 280+ GB | Best | Slowest |
| LoRA | 80 GB | -1% | Fast |
| QLoRA | 40 GB | -2% | Faster |
| QLoRA-FA | 38 GB | -2% | Faster |
| DoRA | 42 GB | -1.5% | Slightly slower |
For 95% of practical fine-tuning use cases (instruction tuning, domain adaptation, style adaptation): QLoRA is the right choice.
Hardware Requirements by Model Size {#hardware}
| Model | QLoRA VRAM | Time (1K examples, 3 epochs) |
|---|---|---|
| 1B - 3B | 6-8 GB | 30 min - 1 hr (RTX 4060) |
| 7B - 8B | 10-14 GB | 1-3 hrs (RTX 4070) |
| 13B - 14B | 16-20 GB | 3-6 hrs (RTX 4080) |
| 32B | 22-26 GB | 6-12 hrs (RTX 4090, tight) |
| 70B | 38-44 GB | 24-36 hrs (RTX 4090 + paged optim, or 2x 3090) |
Add 30% memory for context >2048; pre-tokenize dataset to estimate VRAM needs.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Choosing a Framework: Unsloth, Axolotl, TRL {#frameworks}
| Framework | Best For | Speed |
|---|---|---|
| Unsloth | Single-GPU experimentation | 2x faster than vanilla |
| Axolotl | Production pipelines, multi-GPU | Standard |
| TRL (HF) | Fine-grained control | Standard |
| LLaMA-Factory | UI-based fine-tuning | Standard |
For most readers in 2026: Unsloth.
Dataset Preparation {#dataset}
ChatML JSONL:
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is local AI?"},
{"role": "assistant", "content": "Local AI runs..."}
]}
Quality > volume:
- 500-5000 high-quality examples beats 50K low-quality
- Diverse coverage of edge cases your real workflow encounters
- Consistent formatting and style
- Clean of PII unless target domain requires
Tools for dataset creation: synthetic generation with a stronger model + manual review, scraping internal documentation, transcribing real customer interactions (with consent).
Unsloth Walkthrough {#unsloth}
pip install unsloth
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
# Load 4-bit base
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=32, lora_alpha=64, lora_dropout=0,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
# Load dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")
# Train
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=dataset,
max_seq_length=4096,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_ratio=0.03,
num_train_epochs=3,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
output_dir="./qlora_output",
),
)
trainer.train()
# Save adapter
model.save_pretrained("my-lora-adapter")
Time on RTX 4090: ~2-4 hours for 1K examples.
Axolotl Walkthrough {#axolotl}
YAML config:
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
datasets:
- path: ./train.jsonl
type: chat_template
sequence_len: 4096
sample_packing: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
warmup_ratio: 0.03
optimizer: paged_adamw_8bit
bf16: auto
flash_attention: true
output_dir: ./qlora_output
accelerate launch -m axolotl.cli.train config.yml
For multi-GPU: accelerate launch --num_processes 2 -m axolotl.cli.train config.yml after accelerate config.
Hyperparameter Tuning {#hyperparameters}
| Hyperparameter | Default | Range | Effect |
|---|---|---|---|
| Learning rate | 2e-4 | 1e-4 to 5e-4 | Too high = NaN; too low = no learning |
| Rank (r) | 32 | 8 to 128 | Higher = more capacity, more memory |
| Alpha | 2×r | r to 4×r | Scaling factor |
| Dropout | 0 | 0 to 0.1 | Regularization for small datasets |
| Epochs | 3 | 1 to 5 | Too many = overfit on small data |
| Batch size | 1-4 | depends on VRAM | Higher = more stable gradients |
| Grad accumulation | 4-16 | depends | Effective batch = micro × accum |
For most domain adaptations: rank 32, alpha 64, LR 2e-4, 3 epochs is the right starting point.
Training Loop Monitoring {#monitoring}
Track:
- Train loss: should decrease smoothly. Sudden spike = LR too high or bad batch.
- Eval loss on held-out set: U-shape = overfit; monotonically decreasing = under-trained.
- Tokens/sec: should be stable; drops = data pipeline issue.
- GPU util: should be 80-95% during training; lower = bottleneck.
Use Weights & Biases or TensorBoard for dashboards. For quick command-line: wandb integration is one flag in Unsloth and Axolotl.
Merging LoRA Adapter {#merging}
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base, "./my-lora-adapter")
merged = model.merge_and_unload()
merged.save_pretrained("./my-merged-model")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer.save_pretrained("./my-merged-model")
The merged model is a standard Llama checkpoint runnable via Ollama / vLLM / llama.cpp without LoRA-specific support.
Multi-LoRA Serving (vLLM) {#multi-lora}
For serving many fine-tunes from one base:
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-lora \
--max-loras 8 --max-lora-rank 64 \
--lora-modules legal=./loras/legal medical=./loras/medical
Per-request adapter selection via extra_body: {"lora_request": {"name": "legal"}}.
Multi-GPU QLoRA {#multi-gpu}
For 2-4 GPU setups:
# Axolotl with FSDP
accelerate config # set num_processes, mixed_precision bf16
accelerate launch -m axolotl.cli.train config.yml
For 70B QLoRA on 2x RTX 3090 with NVLink: ~12-18 hours for 1K examples (vs 24-36 on single 4090). NVLink helps significantly for multi-GPU LoRA.
Vision-Language QLoRA {#vlm}
For Qwen 2-VL 7B:
# LLaMA-Factory config
model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
finetuning_type: lora
lora_target: all
freeze_vision_tower: true
Standard pattern: freeze the vision encoder, train LoRA on the language model. 1-2K labeled image+text pairs typically substantial accuracy improvement on domain tasks.
Variants: LoRA-FA, GaLore, DoRA {#variants}
- LoRA-FA: freezes the A matrix (random init), trains only B. Halves trainable parameters at minor quality cost.
- GaLore (Gradient Low-Rank Projection): projects gradients into a low-rank subspace before applying. Approximates full fine-tuning at LoRA-like memory.
- DoRA (Weight-Decomposed LoRA): decomposes pretrained weights into magnitude × direction; trains direction with LoRA. ~1% better quality than QLoRA at same rank.
Unsloth and Axolotl support all four. Default in 2026 remains QLoRA; DoRA is worth experimenting for quality-sensitive tasks.
Common Failures {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM at start | Sequence length too high | Lower sequence_len or use sample packing |
| OOM mid-training | Memory leak / gradient checkpoint disabled | Enable gradient checkpointing |
| NaN loss | LR too high | Drop to 1e-4 |
| Loss plateaus | LR too low or rank too low | Increase one |
| Eval loss diverges from train | Overfitting on small dataset | Add dropout 0.05, fewer epochs |
| Garbage outputs after merge | Wrong target modules | Verify target_modules matches model architecture |
| Multi-GPU slower than single | NCCL config | Tune NCCL_P2P_LEVEL |
FAQ {#faq}
See answers to common QLoRA questions below.
Sources: QLoRA paper (Dettmers et al., 2023) | Unsloth GitHub | Axolotl GitHub | HuggingFace TRL | Internal benchmarks RTX 4090, 2x 3090.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!