★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Training

QLoRA Fine-Tuning Complete Guide (2026): Train 70B Models on a Single 24GB GPU

May 1, 2026
18 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

QLoRA is the technique that made fine-tuning 70B models on consumer GPUs practical. 4-bit quantize the frozen base, train LoRA adapters in BF16 on top. The result: Llama 3.1 70B fine-tunes on a single RTX 4090. Quality typically matches LoRA at ~30% of full-fine-tuning memory. Pair with Unsloth or Axolotl for production-grade training pipelines.

This guide covers everything: how QLoRA works mathematically, hardware requirements per model size, dataset preparation, the Unsloth / Axolotl / TRL ecosystem, hyperparameter tuning, common training failures, and merging adapters for deployment.

Table of Contents

  1. What QLoRA Is
  2. QLoRA vs LoRA vs Full Fine-Tuning
  3. Hardware Requirements by Model Size
  4. Choosing a Framework: Unsloth, Axolotl, TRL
  5. Dataset Preparation
  6. Unsloth Walkthrough
  7. Axolotl Walkthrough
  8. Hyperparameter Tuning
  9. Training Loop Monitoring
  10. Merging LoRA Adapter
  11. Multi-LoRA Serving (vLLM)
  12. Multi-GPU QLoRA
  13. Vision-Language QLoRA
  14. Variants: LoRA-FA, GaLore, DoRA
  15. Common Failures
  16. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What QLoRA Is {#what-it-is}

Standard fine-tuning updates all weights — needs ~4x the model size in optimizer states (Adam: weights + grads + 2x momentum). 70B model = 280 GB memory minimum.

LoRA: freeze base, train small low-rank adapter matrices A·B (rank r ≪ d). Memory: ~base size + small adapter overhead. ~80 GB for 70B in BF16 + adapters.

QLoRA (Dettmers et al., 2023): 4-bit NF4 quantize the frozen base, train LoRA in BF16. Memory: ~40 GB for 70B base + adapters. Fits a single 24GB+ card with paged optimizer.

Quality: typically within 1% of LoRA, within 2-3% of full fine-tuning.


QLoRA vs LoRA vs Full Fine-Tuning {#vs-others}

MethodVRAM (70B)QualitySpeed
Full fine-tuning280+ GBBestSlowest
LoRA80 GB-1%Fast
QLoRA40 GB-2%Faster
QLoRA-FA38 GB-2%Faster
DoRA42 GB-1.5%Slightly slower

For 95% of practical fine-tuning use cases (instruction tuning, domain adaptation, style adaptation): QLoRA is the right choice.


Hardware Requirements by Model Size {#hardware}

ModelQLoRA VRAMTime (1K examples, 3 epochs)
1B - 3B6-8 GB30 min - 1 hr (RTX 4060)
7B - 8B10-14 GB1-3 hrs (RTX 4070)
13B - 14B16-20 GB3-6 hrs (RTX 4080)
32B22-26 GB6-12 hrs (RTX 4090, tight)
70B38-44 GB24-36 hrs (RTX 4090 + paged optim, or 2x 3090)

Add 30% memory for context >2048; pre-tokenize dataset to estimate VRAM needs.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Choosing a Framework: Unsloth, Axolotl, TRL {#frameworks}

FrameworkBest ForSpeed
UnslothSingle-GPU experimentation2x faster than vanilla
AxolotlProduction pipelines, multi-GPUStandard
TRL (HF)Fine-grained controlStandard
LLaMA-FactoryUI-based fine-tuningStandard

For most readers in 2026: Unsloth.


Dataset Preparation {#dataset}

ChatML JSONL:

{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is local AI?"},
  {"role": "assistant", "content": "Local AI runs..."}
]}

Quality > volume:

  • 500-5000 high-quality examples beats 50K low-quality
  • Diverse coverage of edge cases your real workflow encounters
  • Consistent formatting and style
  • Clean of PII unless target domain requires

Tools for dataset creation: synthetic generation with a stronger model + manual review, scraping internal documentation, transcribing real customer interactions (with consent).


Unsloth Walkthrough {#unsloth}

pip install unsloth
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# Load 4-bit base
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=32, lora_alpha=64, lora_dropout=0,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)

# Load dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")

# Train
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, train_dataset=dataset,
    max_seq_length=4096,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_ratio=0.03,
        num_train_epochs=3,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        output_dir="./qlora_output",
    ),
)
trainer.train()

# Save adapter
model.save_pretrained("my-lora-adapter")

Time on RTX 4090: ~2-4 hours for 1K examples.


Axolotl Walkthrough {#axolotl}

YAML config:

base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

datasets:
  - path: ./train.jsonl
    type: chat_template

sequence_len: 4096
sample_packing: true

micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
warmup_ratio: 0.03
optimizer: paged_adamw_8bit
bf16: auto
flash_attention: true
output_dir: ./qlora_output
accelerate launch -m axolotl.cli.train config.yml

For multi-GPU: accelerate launch --num_processes 2 -m axolotl.cli.train config.yml after accelerate config.


Hyperparameter Tuning {#hyperparameters}

HyperparameterDefaultRangeEffect
Learning rate2e-41e-4 to 5e-4Too high = NaN; too low = no learning
Rank (r)328 to 128Higher = more capacity, more memory
Alpha2×rr to 4×rScaling factor
Dropout00 to 0.1Regularization for small datasets
Epochs31 to 5Too many = overfit on small data
Batch size1-4depends on VRAMHigher = more stable gradients
Grad accumulation4-16dependsEffective batch = micro × accum

For most domain adaptations: rank 32, alpha 64, LR 2e-4, 3 epochs is the right starting point.


Training Loop Monitoring {#monitoring}

Track:

  • Train loss: should decrease smoothly. Sudden spike = LR too high or bad batch.
  • Eval loss on held-out set: U-shape = overfit; monotonically decreasing = under-trained.
  • Tokens/sec: should be stable; drops = data pipeline issue.
  • GPU util: should be 80-95% during training; lower = bottleneck.

Use Weights & Biases or TensorBoard for dashboards. For quick command-line: wandb integration is one flag in Unsloth and Axolotl.


Merging LoRA Adapter {#merging}

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base, "./my-lora-adapter")
merged = model.merge_and_unload()
merged.save_pretrained("./my-merged-model")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
tokenizer.save_pretrained("./my-merged-model")

The merged model is a standard Llama checkpoint runnable via Ollama / vLLM / llama.cpp without LoRA-specific support.


Multi-LoRA Serving (vLLM) {#multi-lora}

For serving many fine-tunes from one base:

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --enable-lora \
    --max-loras 8 --max-lora-rank 64 \
    --lora-modules legal=./loras/legal medical=./loras/medical

Per-request adapter selection via extra_body: {"lora_request": {"name": "legal"}}.

See vLLM multi-LoRA section.


Multi-GPU QLoRA {#multi-gpu}

For 2-4 GPU setups:

# Axolotl with FSDP
accelerate config  # set num_processes, mixed_precision bf16
accelerate launch -m axolotl.cli.train config.yml

For 70B QLoRA on 2x RTX 3090 with NVLink: ~12-18 hours for 1K examples (vs 24-36 on single 4090). NVLink helps significantly for multi-GPU LoRA.


Vision-Language QLoRA {#vlm}

For Qwen 2-VL 7B:

# LLaMA-Factory config
model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
finetuning_type: lora
lora_target: all
freeze_vision_tower: true

Standard pattern: freeze the vision encoder, train LoRA on the language model. 1-2K labeled image+text pairs typically substantial accuracy improvement on domain tasks.


Variants: LoRA-FA, GaLore, DoRA {#variants}

  • LoRA-FA: freezes the A matrix (random init), trains only B. Halves trainable parameters at minor quality cost.
  • GaLore (Gradient Low-Rank Projection): projects gradients into a low-rank subspace before applying. Approximates full fine-tuning at LoRA-like memory.
  • DoRA (Weight-Decomposed LoRA): decomposes pretrained weights into magnitude × direction; trains direction with LoRA. ~1% better quality than QLoRA at same rank.

Unsloth and Axolotl support all four. Default in 2026 remains QLoRA; DoRA is worth experimenting for quality-sensitive tasks.


Common Failures {#troubleshooting}

SymptomCauseFix
OOM at startSequence length too highLower sequence_len or use sample packing
OOM mid-trainingMemory leak / gradient checkpoint disabledEnable gradient checkpointing
NaN lossLR too highDrop to 1e-4
Loss plateausLR too low or rank too lowIncrease one
Eval loss diverges from trainOverfitting on small datasetAdd dropout 0.05, fewer epochs
Garbage outputs after mergeWrong target modulesVerify target_modules matches model architecture
Multi-GPU slower than singleNCCL configTune NCCL_P2P_LEVEL

FAQ {#faq}

See answers to common QLoRA questions below.


Sources: QLoRA paper (Dettmers et al., 2023) | Unsloth GitHub | Axolotl GitHub | HuggingFace TRL | Internal benchmarks RTX 4090, 2x 3090.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Unsloth + Axolotl QLoRA training reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators