SD Forge Guide (2026): Faster Automatic1111 with Native Flux and SD 3.5 Support
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
SD Forge is what Automatic1111 looks like when someone rewrites the backend for memory and speed. lllyasviel — author of ControlNet, Foocus, and several Stable Diffusion architectural papers — built it as a faster A1111 with native Flux Dev support and dynamic VRAM offloading. The UI is identical, the extensions mostly carry over, but generation is 30-60% faster on the same hardware and a 6 GB GPU can run Flux NF4 where A1111 would OOM.
This guide covers everything: installation across platforms, the differences from A1111, the reForge community fork, native Flux Dev / Schnell / SD 3.5 support, NF4 / FP8 / GGUF quantization for low-VRAM users, ControlNet integration, extension compatibility, and migration from an existing A1111 setup.
Table of Contents
- What Forge Is and Why Use It
- Forge vs reForge
- Forge vs A1111 vs ComfyUI
- Hardware Requirements
- Installation: Windows, Linux, Mac
- Migrating from A1111
- The UNet Patcher and Dynamic Offloading
- Flux Dev / Schnell Setup
- SD 3.5 Setup
- SDXL and SD 1.5 in Forge
- LoRA / ControlNet / Embeddings
- NF4 Quantization for 6-8 GB VRAM
- GGUF Quantization for Flux
- Extensions Compatibility
- API Mode
- Tuning Recipes
- AMD and Mac Setup
- Real Benchmarks
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Forge Is and Why Use It {#what-it-is}
Stable Diffusion Forge ("Forge" for short) is a fork of Automatic1111 (A1111) maintained by lllyasviel since early 2024. The user interface is identical to A1111 — same tabs, same scripts, same extension API surface. Under the hood, the model loader, sampler scheduler, and memory manager are rewritten.
Headline benefits:
- 30-60% faster generation on the same hardware vs A1111
- Lower VRAM thanks to the UNet patcher (dynamic offload)
- Native Flux Dev / Schnell / SD 3.5 / SD 3 support
- Built-in FlashAttention without xformers dependency hell
- NF4 / FP8 / GGUF Flux for tight VRAM budgets
Project: github.com/lllyasviel/stable-diffusion-webui-forge. License: AGPL-3.0.
Forge vs reForge {#forge-vs-reforge}
| Variant | Maintainer | Status (mid-2026) |
|---|---|---|
| Original Forge | lllyasviel | Active, conservative pace |
| reForge | Panchovix | Active, faster A1111 feature merges |
| sd-webui-forge-classic | community | Frozen but stable |
Most users in 2026: install reForge for the most up-to-date experience that combines original Forge's performance work with current A1111 features. Performance is essentially identical between the two.
Forge vs A1111 vs ComfyUI {#comparison}
| Property | A1111 | Forge / reForge | ComfyUI |
|---|---|---|---|
| UI | Tabbed | Tabbed (same as A1111) | Node graph |
| SDXL speed (RTX 4090) | 4 sec/image | ~3 sec/image | ~3 sec/image |
| Flux Dev speed | Limited support | Native, fast | Native, fast |
| Min VRAM for Flux Dev | 24 GB (FP8) | 6 GB (NF4) | 8 GB (NF4 / GGUF) |
| Extension compatibility | All A1111 | Most A1111 | ComfyUI ecosystem |
| Workflow flexibility | Medium | Medium | Highest |
| Learning curve | Easy | Easy (same as A1111) | Steep |
For users transitioning from A1111 to Flux: Forge is the no-brainer choice. For brand-new users wanting maximum flexibility: ComfyUI.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Hardware Requirements {#requirements}
| GPU VRAM | Models you can run |
|---|---|
| 4 GB | SD 1.5 only |
| 6 GB | SD 1.5, SDXL Lightning, Flux NF4 |
| 8 GB | SDXL, SD 3.5 Medium, Flux GGUF Q4 |
| 12 GB | SDXL with refiner, SD 3.5 Large, Flux FP8 |
| 16 GB | All SDXL, Flux Dev FP8 |
| 24 GB | Flux Dev BF16, large LoRA stacks |
Forge's memory advantage over A1111 comes from the UNet patcher — see section.
Installation: Windows, Linux, Mac {#installation}
Windows
# Install Python 3.10.6, git
git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
.\webui-user.bat
First run takes 10-20 minutes (PyTorch + dependencies).
Linux
git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
./webui.sh
Mac (Apple Silicon)
git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
./webui.sh
MPS auto-detected. Performance lags NVIDIA — Draw Things (MLX-native) is faster on Mac.
Migrating from A1111 {#migration}
Keep your existing A1111 install. Install Forge to a separate directory. Edit Forge's webui-user.bat / .sh to add:
COMMANDLINE_ARGS=--ckpt-dir /path/to/A1111/models/Stable-diffusion \
--lora-dir /path/to/A1111/models/Lora \
--vae-dir /path/to/A1111/models/VAE \
--controlnet-dir /path/to/A1111/extensions/sd-webui-controlnet/models \
--embeddings-dir /path/to/A1111/embeddings
All your models / LoRAs / VAEs / ControlNet / embeddings appear in Forge unchanged. No conversion needed.
For extensions, install the same ones via Forge's Extensions tab → most "just work."
The UNet Patcher and Dynamic Offloading {#unet-patcher}
The UNet patcher tracks VRAM usage in real time and offloads layers to system RAM only when needed. Compared to A1111's --medvram (offload everything always) or --lowvram (offload aggressively), Forge's approach is dynamic and per-layer.
Result on a 12 GB GPU running Flux Dev FP8:
- A1111 with
--lowvram: ~25 sec / image, full offload - Forge default: ~12 sec / image, partial offload as needed
You do NOT need to set memory flags in Forge — it figures out the right behavior automatically.
For very tight VRAM you can still hint:
COMMANDLINE_ARGS=--always-low-vram # force low-VRAM mode
COMMANDLINE_ARGS=--cuda-malloc # fallback to default allocator
Flux Dev / Schnell Setup {#flux}
Files needed
models/Stable-diffusion/flux1-dev.safetensors # 24 GB BF16
or flux1-dev-fp8.safetensors # 12 GB FP8
or flux1-dev-Q8_0.gguf # 13 GB GGUF
or flux1-dev-bnb-nf4-v2.safetensors # 6.5 GB NF4
models/Stable-diffusion/flux1-schnell.safetensors # 24 GB BF16
models/VAE/ae.safetensors # Flux VAE
models/text_encoder/clip_l.safetensors # 246 MB
models/text_encoder/t5xxl_fp16.safetensors # 9.8 GB (or fp8 at 5 GB)
For Flux specifically, Forge has a "Diffusion in Low Bits" dropdown in the UI that lets you pick automatic offloading strategies (NF4, FP8, BF16, FP16). Select it before generating.
Recommended Flux Dev settings:
- Sampler: Euler
- Schedule: Simple
- Steps: 20
- CFG: 1.0
- Resolution: 1024x1024 (Flux is trained for 0.5-2 megapixel)
For Flux Schnell: 4 steps, CFG 1.0. ~5x faster than Dev.
SD 3.5 Setup {#sd-35}
Stability AI's SD 3.5 family works in Forge natively:
models/Stable-diffusion/sd3.5_large.safetensors # 8 GB
or sd3.5_medium.safetensors # 4 GB
You also need the SD 3 text encoders (CLIP-G, CLIP-L, T5-XXL). Forge prompts you on first run.
Recommended settings:
- Sampler: Euler
- Schedule: Simple
- Steps: 28-40
- CFG: 4.5
SD 3.5 Large requires ~16 GB VRAM at FP16; the Medium model fits in ~10 GB.
SDXL and SD 1.5 in Forge {#sdxl}
Identical workflow to A1111 — same UI, same parameters, same extensions. Speed advantage on the same hardware:
| GPU | A1111 (sec) | Forge (sec) |
|---|---|---|
| RTX 3060 12GB | 10 | 7 |
| RTX 4090 24GB | 4 | 3 |
| RX 7900 XTX | 9 | 6 |
| M4 Max | 22 | 16 |
SDXL Lightning + LCM samplers feel especially snappy in Forge.
LoRA / ControlNet / Embeddings {#lora-controlnet}
LoRAs: identical to A1111 — <lora:name:0.8> in prompt or use the LoRA tab.
ControlNet: Forge has built-in ControlNet integration that supersedes the A1111 extension on Forge. Drop ControlNet models in models/ControlNet/. Use via the ControlNet section in txt2img / img2img tab.
Forge's ControlNet supports SDXL, SD 1.5, SD 3.5, and Flux ControlNets natively — A1111 needs separate extensions / patches for each.
Embeddings: identical workflow.
For deep ControlNet workflows: see Automatic1111 Guide and ComfyUI Complete Guide.
NF4 Quantization for 6-8 GB VRAM {#nf4}
NF4 (NormalFloat 4-bit) is bitsandbytes-compatible quantization. Forge supports it natively for Flux:
models/Stable-diffusion/flux1-dev-bnb-nf4-v2.safetensors # 6.5 GB
Quality vs FP8: ~95% — slight loss on fine detail and text rendering, comparable on overall composition. Speed: ~30-50% slower than FP8 due to dequant overhead but uses 50% less VRAM.
For 6 GB / 8 GB cards (RTX 3060 / 4060 / 4060 Ti), NF4 is the only practical Flux Dev option.
GGUF Quantization for Flux {#gguf}
GGUF Flux quants from city96 / leejet:
flux1-dev-Q8_0.gguf # 13 GB — highest quality
flux1-dev-Q6_K.gguf # 10 GB — balanced
flux1-dev-Q5_K_M.gguf # 8.5 GB
flux1-dev-Q4_K_S.gguf # 7 GB — tightest VRAM
Forge loads them via the standard checkpoint dropdown. Quality at Q8_0 is essentially identical to BF16; Q4_K_S has noticeable loss but fits 8 GB cards.
Extensions Compatibility {#extensions}
Most A1111 extensions work in Forge. Confirmed working in mid-2026:
- sd-webui-controlnet (Forge has built-in but extension also works)
- adetailer
- sd-dynamic-prompts
- sd-webui-additional-networks
- a1111-sd-webui-tagcomplete
- sd-webui-rembg
- sd-webui-segment-anything
Known issues / Forge-specific patches needed:
- multi-diffusion / tiled diffusion — use the Forge-specific fork
- some script-based extensions that hook deep into the sampler
Install via Extensions tab → Available → Load from. Restart after install.
API Mode {#api}
Identical to A1111:
./webui.sh --api --listen
Endpoints: /sdapi/v1/txt2img, /sdapi/v1/img2img, etc. See A1111 API section.
Tuning Recipes {#tuning}
RTX 3060 12 GB
# webui-user.bat / .sh
COMMANDLINE_ARGS=--listen
(Forge auto-detects best settings; explicit flags rarely needed.)
RTX 4090 24 GB
COMMANDLINE_ARGS=--listen --api
Tight VRAM (6 GB)
COMMANDLINE_ARGS=--listen --always-low-vram
Use NF4 Flux for image gen.
Apple M4 Max
COMMANDLINE_ARGS=--listen --no-half-vae
AMD and Mac Setup {#amd-mac}
AMD (ROCm)
Use the lshqqytiger AMD-friendly fork: stable-diffusion-webui-amdgpu-forge. Same install path as Forge but with ROCm-specific patches. Performance on RX 7900 XTX: ~75-85% of equivalent NVIDIA. See AMD ROCm Setup.
Mac (Apple Silicon)
Native Forge works via MPS but is slower than discrete NVIDIA. For best Mac performance use Draw Things (MLX-native, separate app); for tabbed UI compatibility, stick with Forge.
Real Benchmarks {#benchmarks}
| Workflow | A1111 | Forge | ComfyUI |
|---|---|---|---|
| SDXL 1024² (RTX 4090) | 4.0 sec | 3.0 sec | 2.8 sec |
| Flux Dev FP8 1024² (RTX 4090) | 22 sec | 9 sec | 8 sec |
| Flux Schnell 1024² (RTX 4090) | 8 sec | 3 sec | 3 sec |
| SDXL 1024² (RX 7900 XTX) | 9 sec | 6 sec | 6 sec |
| Flux Dev NF4 (RTX 3060 12GB) | OOM | 14 sec | 16 sec |
| SDXL 1024² (M4 Max) | 22 sec | 16 sec | 14 sec |
Forge erases A1111's slowness on Flux while keeping the familiar UI.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Black images | NaN in VAE | --no-half-vae |
| OOM at Flux Dev BF16 | Need 24 GB | Switch to FP8 or NF4 |
| Extension breaks | Forge-incompatible | Look for Forge-specific fork |
| Slow first generation | Lazy load | Subsequent fast |
| ControlNet has no effect | Wrong base | SDXL ControlNet on SDXL only |
| AMD: hipErrorNoBinaryForGpu | Wrong fork | Use amdgpu-forge fork |
| Mac: incompatible PyTorch | Python ≠ 3.10 | Reinstall Python 3.10 |
FAQ {#faq}
See answers to common SD Forge questions below.
Sources: SD Forge GitHub | reForge GitHub | lllyasviel's blog | Internal benchmarks RTX 3060, 4090, RX 7900 XTX, M4 Max.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!