Question 1

What is QLoRA and how does it differ from LoRA and full fine-tuning?

Accepted Answer

QLoRA = Quantized LoRA. Standard fine-tuning updates all model weights — needs 4-8x the model's VRAM in optimizer states. LoRA freezes the base and trains small low-rank adapter matrices — needs ~30% of full fine-tuning memory. QLoRA goes further: 4-bit quantize the frozen base, then train LoRA adapters in BF16/FP16 on top. The result: train Llama 3.1 70B on a single 24GB GPU. Quality typically matches LoRA fine-tuning (within 1% on benchmarks), with negligible loss vs full fine-tuning for most adaptation tasks.

Question 2

What hardware do I need to QLoRA-fine-tune Llama 3.1 70B?

Accepted Answer

Single RTX 4090 24 GB or 2x RTX 3090 with NVLink. The 4-bit quantized 70B base fits in ~40 GB; with LoRA adapters and optimizer state, total ~46-50 GB. On a single 24 GB card you need gradient accumulation + paged attention + sequence length 2048. Time: ~24-36 hours for 1K-example fine-tune. Speed up with 2x 24GB cards or H100. For Llama 3.1 8B QLoRA: comfortable on a single RTX 4070 16 GB in ~2-4 hours per 1K examples.

Question 3

Should I use Unsloth, Axolotl, or HuggingFace TRL for QLoRA?

Accepted Answer

Unsloth is fastest — custom Triton kernels deliver 2x speedup vs vanilla HF + bitsandbytes. Best single-GPU choice. Axolotl wraps TRL with YAML configs and supports more advanced patterns (multi-GPU FSDP, custom dataset formats, DPO/ORPO/KTO). Best for production fine-tuning pipelines. TRL (HuggingFace) is the underlying library — fine-grained control but more setup. For most users in 2026: start with Unsloth for single-GPU experimentation, graduate to Axolotl for production.

Question 4

What dataset format does QLoRA expect?

Accepted Answer

Most common: ChatML or instruction-format JSONL. Each example: `{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}`. For Alpaca-style: `{"instruction": "...", "input": "...", "output": "..."}`. Quantity: 500-5000 high-quality examples typically beats 50K low-quality ones. Quality matters more than volume for QLoRA. For domain adaptation, scrape 1000 representative examples from your real workflow rather than synthesizing many.

Question 5

What hyperparameters matter most?

Accepted Answer

Top three: (1) **learning rate** — 1e-4 to 3e-4 for QLoRA (higher than full fine-tuning); too high = unstable, too low = no adaptation. (2) **rank** — 16 for light task, 32 standard, 64 for heavy domain shift. (3) **alpha** — typically 2x rank. Less critical: dropout (0.05 default), warmup ratio (0.03), epochs (1-3). For most tasks: rank 32, alpha 64, LR 2e-4, 3 epochs is a strong default. For strong domain shifts (legal, medical), rank 64+.

Question 6

How do I merge a LoRA adapter back into the full model?

Accepted Answer

Use `peft.PeftModel.merge_and_unload()`. The adapter weights mathematically combine with base weights — the result is a full-precision Llama / Qwen / Mistral that runs in standard inference engines without LoRA-specific support. Trade-off: you lose the ability to swap adapters per request. For deployments serving multiple fine-tunes from one base: keep adapters separate and use multi-LoRA serving (vLLM `--enable-lora`). For single-purpose deployment: merge for slightly faster inference.

Question 7

Can I QLoRA-fine-tune for vision-language or multimodal models?

Accepted Answer

Yes — Qwen 2-VL, Llama 3.2 Vision, Pixtral all support QLoRA via Axolotl or LLaMA-Factory. Typical pattern: freeze the vision encoder + train LoRA on the language model only. For domain adaptation (medical imaging, legal documents), 1-2K labeled image+text pairs at QLoRA rank 32 substantially improves accuracy. Time: 6-12 hours on RTX 4090 for a 7B vision model.

Question 8

What's the difference between QLoRA, LoRA-FA, GaLore, and DoRA?

Accepted Answer

QLoRA: 4-bit base + standard LoRA adapters. LoRA-FA (LoRA Frozen-A): freezes one of the two LoRA matrices, halves trainable params. GaLore: gradient projection trick that approximates full fine-tuning at LoRA-like memory. DoRA (Weight-Decomposed LoRA): decomposes weights into magnitude + direction, trains direction with LoRA — slightly better quality at same rank. For most users in 2026: QLoRA is the proven default; DoRA is a worthwhile incremental upgrade; GaLore is research-interesting but rare in production. Unsloth and Axolotl support all four.

Method	VRAM (70B)	Quality	Speed
Full fine-tuning	280+ GB	Best	Slowest
LoRA	80 GB	-1%	Fast
QLoRA	40 GB	-2%	Faster
QLoRA-FA	38 GB	-2%	Faster
DoRA	42 GB	-1.5%	Slightly slower

Model	QLoRA VRAM	Time (1K examples, 3 epochs)
1B - 3B	6-8 GB	30 min - 1 hr (RTX 4060)
7B - 8B	10-14 GB	1-3 hrs (RTX 4070)
13B - 14B	16-20 GB	3-6 hrs (RTX 4080)
32B	22-26 GB	6-12 hrs (RTX 4090, tight)
70B	38-44 GB	24-36 hrs (RTX 4090 + paged optim, or 2x 3090)

Framework	Best For	Speed
Unsloth	Single-GPU experimentation	2x faster than vanilla
Axolotl	Production pipelines, multi-GPU	Standard
TRL (HF)	Fine-grained control	Standard
LLaMA-Factory	UI-based fine-tuning	Standard

Hyperparameter	Default	Range	Effect
Learning rate	2e-4	1e-4 to 5e-4	Too high = NaN; too low = no learning
Rank (r)	32	8 to 128	Higher = more capacity, more memory
Alpha	2×r	r to 4×r	Scaling factor
Dropout	0	0 to 0.1	Regularization for small datasets
Epochs	3	1 to 5	Too many = overfit on small data
Batch size	1-4	depends on VRAM	Higher = more stable gradients
Grad accumulation	4-16	depends	Effective batch = micro × accum

Symptom	Cause	Fix
OOM at start	Sequence length too high	Lower sequence_len or use sample packing
OOM mid-training	Memory leak / gradient checkpoint disabled	Enable gradient checkpointing
NaN loss	LR too high	Drop to 1e-4
Loss plateaus	LR too low or rank too low	Increase one
Eval loss diverges from train	Overfitting on small dataset	Add dropout 0.05, fewer epochs
Garbage outputs after merge	Wrong target modules	Verify target_modules matches model architecture
Multi-GPU slower than single	NCCL config	Tune NCCL_P2P_LEVEL

QLoRA Fine-Tuning Complete Guide (2026): Train 70B Models on a Single 24GB GPU

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What QLoRA Is {#what-it-is}

QLoRA vs LoRA vs Full Fine-Tuning {#vs-others}

Hardware Requirements by Model Size {#hardware}

Reading articles is good. Building is better.

Choosing a Framework: Unsloth, Axolotl, TRL {#frameworks}

Dataset Preparation {#dataset}

Unsloth Walkthrough {#unsloth}

Axolotl Walkthrough {#axolotl}

Hyperparameter Tuning {#hyperparameters}

Training Loop Monitoring {#monitoring}

Merging LoRA Adapter {#merging}

Multi-LoRA Serving (vLLM) {#multi-lora}

Multi-GPU QLoRA {#multi-gpu}

Vision-Language QLoRA {#vlm}

Variants: LoRA-FA, GaLore, DoRA {#variants}

Common Failures {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

LoRA Fine-Tuning Local Guide

DPO / ORPO / KTO Guide

Train AI Model Own Data Locally

Fine Tune Local AI Business

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI