Train an Image LoRA Locally (2026): Kohya, SDXL & FLUX
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Generating images locally? Take it further. From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
To train a Stable Diffusion image LoRA locally in 2026, use Kohya's sd-scripts (the standard trainer): an SDXL LoRA needs about 12 GB of VRAM minimum (16 GB comfortable), while a FLUX.1 LoRA traditionally wanted 24 GB but now runs on 16 GB — and as low as 4-8 GB — thanks to Kohya's fused-backward-pass optimizations added in v0.9.0 (January 2025). You only need 10-30 captioned images, a sensible learning rate (around 1e-4), and a LoRA rank of roughly 16-32 for FLUX or 32-64 for SDXL. Plan on 1-3 hours of training on a 24 GB card, then publish the resulting .safetensors file to CivitAI.
This guide is specifically about image LoRAs — teaching a diffusion model a new character, art style, or product. That is a completely different toolchain from training a text LLM LoRA. If you came here looking to fine-tune a language model with Unsloth/PEFT instead, read our separate LLM LoRA fine-tuning guide; everything below targets SDXL and FLUX.1 diffusion models.
What is the standard tool for training an image LoRA locally?
The de facto standard is Kohya's sd-scripts (kohya-ss/sd-scripts on GitHub) — a Python toolkit of training scripts for Stable Diffusion and related image models. It ships sdxl_train_network.py for SDXL LoRAs and flux_train_network.py for FLUX.1 LoRAs, plus support for SD 1.5/2.x, SD3/3.5 and other models. Almost every other local image-LoRA tool wraps these scripts.
Your practical options, all built on the same engine:
| Tool | What it is | Best for | Min VRAM (LoRA) |
|---|---|---|---|
| kohya-ss/sd-scripts | The raw upstream training scripts (CLI/TOML config) | Maximum control, repeatable configs | SDXL ~8-12 GB / FLUX ~4-16 GB |
| bmaltais/kohya_ss | A Gradio GUI wrapping sd-scripts | Beginners who want a UI over the same scripts | Same as above |
| FluxGym | A "dead simple" low-VRAM web UI for FLUX, wrapping Kohya | FLUX on 12/16/20 GB cards | FLUX ~12 GB |
| ComfyUI-FluxTrainer (kijai) | A ComfyUI node set wrapping modified Kohya FLUX scripts | People already living in ComfyUI | FLUX ~12 GB (split_mode) |
The takeaway: pick the interface you like, but the actual math is Kohya's underneath. You can verify the script list and supported models on the official kohya-ss/sd-scripts repository. If you have never installed a local diffusion stack at all, start with our FLUX local image generation guide first so you already have the base models and Python environment in place.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
SDXL LoRA vs FLUX LoRA: which should you train?
This is the first real decision, because it sets your VRAM floor and your settings.
- SDXL (Stable Diffusion XL) is a 3.5B-parameter U-Net diffusion model. It is the easy, forgiving option: lighter on VRAM, faster to train, very tolerant of imperfect settings, and backed by a massive ecosystem of existing LoRAs and checkpoints.
- FLUX.1 is a 12-billion-parameter rectified-flow transformer (a DiT — Diffusion Transformer) from Black Forest Labs. It produces noticeably better prompt adherence and anatomy, but it is heavier and — importantly — more sensitive to learning rate than SDXL. Push the LR too high on FLUX and outputs over-saturate or distort; SDXL shrugs off the same mistake.
| Factor | SDXL LoRA | FLUX.1 LoRA |
|---|---|---|
| Base architecture | 3.5B U-Net diffusion | 12B rectified-flow DiT |
| Min VRAM (LoRA) | ~12 GB (8-10 GB with optimizations) | ~16 GB (4-8 GB heavily optimized) |
| Recommended VRAM | 16 GB | 24 GB |
| Typical LoRA rank (dim) | 32-64 | 16-32 |
| Typical learning rate | ~1e-4 | ~1e-4, but lower-tolerance |
| LR sensitivity | Forgiving | Touchy — sweep carefully |
| Training resolution | 1024×1024 | 1024×1024 |
| Image quality ceiling | High | Higher (better hands/text) |
Rule of thumb: if you have a 12 GB card, train SDXL. If you have 16 GB, you can do either (FLUX via FluxGym or ComfyUI-FluxTrainer). If you have 24 GB, train FLUX comfortably and stop worrying about optimization flags.
What about FLUX.2? Black Forest Labs released the FLUX.2 series on 25 November 2025 (Pro, Flex, Dev, and an Apache-2.0 "Klein" variant). The open-weight FLUX.2 dev model is larger and heavier than FLUX.1 and realistically wants 24 GB+ to train, while the smaller Klein is aimed at consumer cards. It's newer, and the consumer-GPU training tooling is still settling. This guide deliberately stays on the proven, lighter FLUX.1 and SDXL path, which is what most people training image LoRAs on their own hardware still use — the workflow below carries over to FLUX.2 once your VRAM and tooling are ready for the bigger model.
How much VRAM do you actually need to train an image LoRA?
Here is the honest, by-tier breakdown for LoRA training specifically (full fine-tuning needs far more). The low-end FLUX numbers are real but come with tradeoffs — block swapping to system RAM, smaller batch sizes, and longer training times.
| GPU VRAM | SDXL LoRA | FLUX.1 LoRA | Notes |
|---|---|---|---|
| 8 GB | ⚠️ Possible | ⚠️ Possible (heavy optimization) | Slow; gradient checkpointing + 8-bit Adam + block swap mandatory |
| 12 GB | ✅ Comfortable | ✅ Via FluxGym / ComfyUI split_mode | The practical SDXL sweet spot (e.g. RTX 3060 12GB) |
| 16 GB | ✅ Easy, batch 2-4 | ✅ Good | RTX 4060 Ti 16GB, 16GB Mac viable |
| 24 GB | ✅ Headroom to spare | ✅ Recommended FLUX tier | RTX 3090 / 4090 — the no-compromise choice |
| 48 GB+ | ✅ | ✅ Full fine-tunes too | A6000/A100 territory |
The big VRAM unlock was Kohya's fused backward pass in sd-scripts v0.9.0 (January 2025), which is what dropped FLUX LoRA training from a hard 24 GB requirement down toward 4-8 GB on optimized configs. Note that "it fits in 8 GB" and "it trains well in 8 GB" are different statements — expect long runs and small batches at the bottom of this table. For a deeper look at what a 24 GB card buys you across image work, see our RTX 3090 local AI guide.
How do you prepare a dataset for an image LoRA?
The dataset is where most LoRAs are won or lost. Quality and consistency beat quantity every time — 20 sharp, varied images outperform 75 inconsistent ones.
1. Collect images (10-30 is plenty). A character LoRA can train on as few as 10 good images; a style LoRA usually wants 15-30 to capture range. Vary the pose, lighting, and background so the model learns the subject, not the backdrop.
2. Resize to the training resolution. For both SDXL and FLUX, 1024×1024 is the standard. Kohya's bucketing can handle mixed aspect ratios, but consistent square crops are the simplest path.
3. Caption every image. Each image needs a matching .txt caption file. For a character LoRA, use a unique trigger word plus minimal description (e.g. "ohwx woman, smiling, outdoor"). For a style LoRA, describe the content and let the style be learned implicitly, or use a style trigger token. Auto-captioners (BLIP, WD14 tagger) get you 80% there; hand-edit the rest.
4. Organize folders with a repeat count. Kohya reads folder names like 10_ohwx woman, where 10 is the number of repeats per image per epoch. Repeats × image count × epochs = total steps.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What learning rate, rank, and step count should you use?
These are sane starting points, not gospel — every dataset is different. Treat the FLUX learning rate as something to sweep, because, as noted, FLUX is far less forgiving than SDXL here.
| Setting | SDXL LoRA | FLUX.1 LoRA |
|---|---|---|
| Network rank (dim) | 32 (up to 64) | 16-32 |
| Network alpha | 16 (half of dim is common) | 16 |
| Learning rate | 1e-4 | ~1e-4, sweep 5e-5 → 2e-4 |
| Optimizer | AdamW8bit | AdamW8bit |
| Resolution | 1024 | 1024 |
| Total steps | ~1,000-2,000 | ~1,000-2,500 |
| Gradient checkpointing | On (saves VRAM) | On (saves VRAM) |
A few notes from running these: higher rank = larger LoRA file and more capacity to memorize, but also more risk of overfitting on a small dataset — for a single character, rank 16-32 is usually enough. The AdamW8bit optimizer alone cuts VRAM meaningfully versus full-precision AdamW, with negligible quality loss, which is why it appears in nearly every low-VRAM recipe.
How do you train SDXL with Kohya sd-scripts (CLI)?
Once your dataset folder and a TOML config are ready, an SDXL LoRA run looks roughly like this. (Paths and the exact flag set depend on your config; this is the shape of the command, not a copy-paste-forever recipe.)
accelerate launch sdxl_train_network.py \
--pretrained_model_name_or_path="/models/sd_xl_base_1.0.safetensors" \
--dataset_config="/dataset/config.toml" \
--output_dir="/output" --output_name="my_sdxl_lora" \
--network_module=networks.lora \
--network_dim=32 --network_alpha=16 \
--learning_rate=1e-4 --optimizer_type=AdamW8bit \
--max_train_steps=1500 --mixed_precision=bf16 \
--gradient_checkpointing --save_model_as=safetensors
The FLUX equivalent uses flux_train_network.py and additionally points at the FLUX text encoders (CLIP-L and T5-XXL) and the FLUX VAE/autoencoder. Because of FLUX's size, low-VRAM runs add block-swap flags to offload transformer blocks to system RAM — which is exactly the kind of plumbing FluxGym and ComfyUI-FluxTrainer hide for you.
What is the ComfyUI Flux Trainer alternative?
If you already run image generation in ComfyUI, you do not need to touch a terminal. ComfyUI-FluxTrainer (by kijai) is a set of custom nodes that wrap slightly modified Kohya FLUX training scripts, letting you build a training workflow as a node graph — and use the same base models you already generate with. It can run on around 12 GB of VRAM with split_mode enabled (it then leans on roughly 32 GB of system RAM), and you install it straight from the ComfyUI Manager.
The appeal is comparison: because settings are nodes, you can branch a graph to test two ranks or learning rates side by side. The tradeoff is that node-based training is fiddlier to reproduce than a saved TOML. New to ComfyUI itself? Our complete ComfyUI guide covers installation and the node basics you'll need before adding the trainer. You can also read the node documentation on the ComfyUI-FluxTrainer repository.
Recipes: character, style, and product LoRAs
The same engine, tuned differently for the three most common jobs:
- Character LoRA. 10-20 images of one subject across varied poses/lighting, a unique trigger token, rank 16-32, ~1,500 steps. Keep backgrounds varied so the LoRA learns the face/body, not a setting. This is the most forgiving recipe and a great first project.
- Style LoRA. 15-30 images that share an art style but differ in subject. Caption the content of each image and let the style be the constant the model absorbs. Slightly higher rank (32) helps capture broad stylistic range; watch for the LoRA "leaking" specific subjects from your dataset (a sign of too-high rank or too-few images).
- Product LoRA. 15-25 clean shots of the product on neutral and varied backgrounds, multiple angles. Consistency of the object matters most — crop tightly, keep lighting honest, and avoid heavy filters that the model would memorize as part of the product.
How long does training take, and where do you publish it?
Approximate, single-machine figures. On an RTX 3090 (24 GB), I'd ballpark an SDXL LoRA at roughly 30-60 minutes for ~1,500 steps at 1024px, and a FLUX.1 LoRA noticeably longer — on the order of 1-3 hours for a comparable run — because of FLUX's 12B size. These are rough, hardware- and config-dependent numbers, not a benchmark; batch size, resolution, and whether you're block-swapping to system RAM swing them a lot. The moment FLUX has to offload blocks to RAM on a smaller card, training time climbs steeply.
When training finishes you get a single .safetensors file. Drop it in your models/Lora folder to use locally, and to share it, the standard hub is CivitAI — the largest community catalog of image models, hosting the overwhelming majority of public Stable Diffusion and FLUX LoRAs (tens of thousands of LoRA resources and counting). On CivitAI you upload the file as a "LoRA" resource, set a category and tags, write a description with example prompts, and choose your merge/commercial permissions before publishing.
Key Takeaways
- Kohya's sd-scripts is the standard image-LoRA trainer. FluxGym, the bmaltais GUI, and ComfyUI-FluxTrainer all wrap it — pick the interface, the engine is the same.
- SDXL needs ~12 GB minimum (16 GB comfortable); FLUX wanted 24 GB but now runs on 16 GB and even 4-8 GB thanks to Kohya's fused backward pass (v0.9.0, Jan 2025) — with longer runs at the bottom.
- FLUX is more learning-rate sensitive than SDXL. Both sit near 1e-4, but sweep FLUX (5e-5 → 2e-4) and use a lower rank (16-32 vs SDXL's 32-64).
- 10-30 captioned images is enough. Quality and variety beat quantity; a character LoRA trains on as few as 10 good images.
- Train, then publish the .safetensors to CivitAI — the dominant community hub for image LoRAs.
Next Steps
- Don't have the base models yet? Set up generation first with our FLUX local image generation guide.
- Prefer a visual workflow over the CLI? Read the complete ComfyUI guide, then add ComfyUI-FluxTrainer.
- Training a text model instead of an image one? That's a different stack — see the LLM LoRA fine-tuning guide.
- Weighing the 24 GB upgrade for faster FLUX runs? Our RTX 3090 local AI guide covers what that VRAM unlocks.
Generating images locally? Take it further.
From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARRun FLUX.1 Locally in 2026: VRAM Needs + 5-Minute Setup
- Best GPU for Local AI Image Generation (2026): Ranked
- Best Local AI Image Models 2026: FLUX vs SDXL vs Qwen
- ComfyUI 2026: Install + ControlNet + FLUX Setup (Full Tutorial)
- ComfyUI FLUX Workflow (2026): JSON Nodes Explained
- FLUX VRAM Requirements by GPU (2026): 8GB to 24GB Guide
- Image-to-Text AI: 89% Caption Accuracy (2026)
- Ollama Image Generation: Run Z-Image & FLUX.2 Locally (2026)
- Run FLUX on 6-8GB VRAM (2026): GGUF & Offloading
- Run FLUX.2 Locally (2026): Klein 9B/4B VRAM + ComfyUI
Comments (0)
No comments yet. Be the first to share your thoughts!