What is SD Forge and why is it faster than Automatic1111?

SD Forge (lllyasviel/stable-diffusion-webui-forge on GitHub) is lllyasviel's fork of A1111 — same UI, but with a rewritten backend optimized for memory and speed. Improvements: dynamic offloading (UNet patcher) that tracks VRAM and offloads only what's needed, native FlashAttention without xformers dependency, SVD-quant for low-VRAM Flux runs, integrated Stable Cascade / Flux / SD 3.5 / SD 3 support, and better attention kernels. Result: 30-60% faster generation than A1111 on the same hardware, plus much better low-VRAM behavior — a 6 GB GPU can run SDXL where A1111 needs --lowvram flags.

Is SD Forge still maintained in 2026? What about reForge?

lllyasviel's original Forge had a maintenance hiatus in early 2024 and the community spun up reForge (Panchovix/stable-diffusion-webui-reForge) which keeps current with A1111 features. As of mid-2026, both repos are active: original Forge for the cleanest implementation and reForge for the latest A1111 + Forge feature merges. Most users should pick reForge for the most up-to-date experience. The performance characteristics are essentially the same.

Should I use Forge or ComfyUI for Flux Dev?

Forge for the A1111-style tabbed UI experience; ComfyUI for fine-grained workflow control. Forge supports Flux Dev natively in BF16 / FP8 / NF4 / GGUF formats with dynamic offloading — runs Flux Dev on 8 GB VRAM via NF4 or 12 GB via FP8. ComfyUI offers more flexibility (LoRA stacks, ControlNet chains, custom node graphs) but a steeper learning curve. Most A1111 users transitioning to Flux should start with Forge; ComfyUI is the right choice when you need workflow flexibility beyond what Forge's UI offers.

How do I migrate from A1111 to Forge?

Install Forge in a separate directory (do NOT delete your A1111 install), then point it at your existing A1111 model paths via `webui-user.bat` / `.sh` or the symlink approach. Set `--ckpt-dir`, `--lora-dir`, `--vae-dir`, `--controlnet-dir`, `--embeddings-dir` to your existing A1111 paths. All your models, LoRAs, VAEs, and embeddings work unchanged. Most A1111 extensions also work in Forge — install via the Extensions tab same as A1111. Saved generation settings transfer 1:1.

What VRAM does Forge need for Flux Dev?

Flux Dev BF16: 24 GB. FP8: 12 GB. GGUF Q8_0: 13 GB. GGUF Q4_K_S: 8 GB. NF4 (4-bit): 6-8 GB. With Forge's dynamic offloading, even tighter scenarios work — a 6 GB card can run Flux Schnell NF4 at ~6-10 sec per image. For Flux Dev at full quality, 12 GB+ is the practical minimum; 24 GB lets you run BF16 with no compromises.

Does Forge support all A1111 extensions?

Most, yes. ControlNet (Mikubill's extension), AdetTailer, dynamic prompts, multi-LoRA, Roop / FaceSwap, image browsers, and tag autocomplete all work. Forge has its own ControlNet integration that's often more efficient than the standalone extension on Forge specifically — leave Mikubill's installed but it auto-defers to Forge's built-in. A handful of extensions that hook deep into A1111's Sampler API need Forge-compatible patches; usually a maintainer ships a Forge variant within weeks.

How does Forge compare to ComfyUI on speed?

For SDXL: roughly equivalent (within 10-15%). ComfyUI has slightly more aggressive caching and benefits from per-node optimization; Forge has the better default UI for tabbed workflows. For Flux: ComfyUI typically edges out by 5-15% thanks to TeaCache / FBCache custom nodes that Forge lacks. For batch generation: ComfyUI's queue is faster. For "click generate, get image with familiar UI": Forge wins. Both crush A1111 baseline by 30-60%.

Can I run Forge on AMD or Apple Silicon?

AMD: yes via ROCm — same install path as A1111, use the lshqqytiger/stable-diffusion-webui-amdgpu-forge fork or ensure ROCm 6.x PyTorch is installed first. Performance on RX 7900 XTX is ~75-85% of equivalent NVIDIA throughput. Apple Silicon: yes via MPS, but slower than NVIDIA — Mac users often prefer Draw Things (MLX-native) for speed. Vulkan / DirectML on older AMD: experimental, expect rough edges.

SD Forge Guide (2026): Faster Automatic1111 with Native Flux and SD 3.5 Support

SD Forge is what Automatic1111 looks like when someone rewrites the backend for memory and speed. lllyasviel — author of ControlNet, Foocus, and several Stable Diffusion architectural papers — built it as a faster A1111 with native Flux Dev support and dynamic VRAM offloading. The UI is identical, the extensions mostly carry over, but generation is 30-60% faster on the same hardware and a 6 GB GPU can run Flux NF4 where A1111 would OOM.

This guide covers everything: installation across platforms, the differences from A1111, the reForge community fork, native Flux Dev / Schnell / SD 3.5 support, NF4 / FP8 / GGUF quantization for low-VRAM users, ControlNet integration, extension compatibility, and migration from an existing A1111 setup.

What Forge Is and Why Use It
Forge vs reForge
Forge vs A1111 vs ComfyUI
Hardware Requirements
Installation: Windows, Linux, Mac
Migrating from A1111
The UNet Patcher and Dynamic Offloading
Flux Dev / Schnell Setup
SD 3.5 Setup
SDXL and SD 1.5 in Forge
LoRA / ControlNet / Embeddings
NF4 Quantization for 6-8 GB VRAM
GGUF Quantization for Flux
Extensions Compatibility
API Mode
Tuning Recipes
AMD and Mac Setup
Real Benchmarks
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What Forge Is and Why Use It {#what-it-is}

Stable Diffusion Forge ("Forge" for short) is a fork of Automatic1111 (A1111) maintained by lllyasviel since early 2024. The user interface is identical to A1111 — same tabs, same scripts, same extension API surface. Under the hood, the model loader, sampler scheduler, and memory manager are rewritten.

Headline benefits:

30-60% faster generation on the same hardware vs A1111
Lower VRAM thanks to the UNet patcher (dynamic offload)
Native Flux Dev / Schnell / SD 3.5 / SD 3 support
Built-in FlashAttention without xformers dependency hell
NF4 / FP8 / GGUF Flux for tight VRAM budgets

Project: github.com/lllyasviel/stable-diffusion-webui-forge. License: AGPL-3.0.

Forge vs reForge {#forge-vs-reforge}

Variant	Maintainer	Status (mid-2026)
Original Forge	lllyasviel	Active, conservative pace
reForge	Panchovix	Active, faster A1111 feature merges
sd-webui-forge-classic	community	Frozen but stable

Most users in 2026: install reForge for the most up-to-date experience that combines original Forge's performance work with current A1111 features. Performance is essentially identical between the two.

Forge vs A1111 vs ComfyUI {#comparison}

Property	A1111	Forge / reForge	ComfyUI
UI	Tabbed	Tabbed (same as A1111)	Node graph
SDXL speed (RTX 4090)	4 sec/image	~3 sec/image	~3 sec/image
Flux Dev speed	Limited support	Native, fast	Native, fast
Min VRAM for Flux Dev	24 GB (FP8)	6 GB (NF4)	8 GB (NF4 / GGUF)
Extension compatibility	All A1111	Most A1111	ComfyUI ecosystem
Workflow flexibility	Medium	Medium	Highest
Learning curve	Easy	Easy (same as A1111)	Steep

For users transitioning from A1111 to Flux: Forge is the no-brainer choice. For brand-new users wanting maximum flexibility: ComfyUI.

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Hardware Requirements {#requirements}

GPU VRAM	Models you can run
4 GB	SD 1.5 only
6 GB	SD 1.5, SDXL Lightning, Flux NF4
8 GB	SDXL, SD 3.5 Medium, Flux GGUF Q4
12 GB	SDXL with refiner, SD 3.5 Large, Flux FP8
16 GB	All SDXL, Flux Dev FP8
24 GB	Flux Dev BF16, large LoRA stacks

Forge's memory advantage over A1111 comes from the UNet patcher — see section.

Installation: Windows, Linux, Mac {#installation}

Windows

# Install Python 3.10.6, git
git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
.\webui-user.bat

First run takes 10-20 minutes (PyTorch + dependencies).

Linux

git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
./webui.sh

Mac (Apple Silicon)

git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
./webui.sh

MPS auto-detected. Performance lags NVIDIA — Draw Things (MLX-native) is faster on Mac.

Migrating from A1111 {#migration}

Keep your existing A1111 install. Install Forge to a separate directory. Edit Forge's webui-user.bat / .sh to add:

COMMANDLINE_ARGS=--ckpt-dir /path/to/A1111/models/Stable-diffusion \
                 --lora-dir /path/to/A1111/models/Lora \
                 --vae-dir /path/to/A1111/models/VAE \
                 --controlnet-dir /path/to/A1111/extensions/sd-webui-controlnet/models \
                 --embeddings-dir /path/to/A1111/embeddings

All your models / LoRAs / VAEs / ControlNet / embeddings appear in Forge unchanged. No conversion needed.

For extensions, install the same ones via Forge's Extensions tab → most "just work."

The UNet Patcher and Dynamic Offloading {#unet-patcher}

The UNet patcher tracks VRAM usage in real time and offloads layers to system RAM only when needed. Compared to A1111's --medvram (offload everything always) or --lowvram (offload aggressively), Forge's approach is dynamic and per-layer.

Result on a 12 GB GPU running Flux Dev FP8:

A1111 with --lowvram: ~25 sec / image, full offload
Forge default: ~12 sec / image, partial offload as needed

You do NOT need to set memory flags in Forge — it figures out the right behavior automatically.

For very tight VRAM you can still hint:

COMMANDLINE_ARGS=--always-low-vram          # force low-VRAM mode
COMMANDLINE_ARGS=--cuda-malloc              # fallback to default allocator

Flux Dev / Schnell Setup {#flux}

Files needed

models/Stable-diffusion/flux1-dev.safetensors          # 24 GB BF16
                       or flux1-dev-fp8.safetensors    # 12 GB FP8
                       or flux1-dev-Q8_0.gguf          # 13 GB GGUF
                       or flux1-dev-bnb-nf4-v2.safetensors  # 6.5 GB NF4
models/Stable-diffusion/flux1-schnell.safetensors      # 24 GB BF16

models/VAE/ae.safetensors                              # Flux VAE

models/text_encoder/clip_l.safetensors                 # 246 MB
models/text_encoder/t5xxl_fp16.safetensors             # 9.8 GB (or fp8 at 5 GB)

For Flux specifically, Forge has a "Diffusion in Low Bits" dropdown in the UI that lets you pick automatic offloading strategies (NF4, FP8, BF16, FP16). Select it before generating.

Recommended Flux Dev settings:

Sampler: Euler
Schedule: Simple
Steps: 20
CFG: 1.0
Resolution: 1024x1024 (Flux is trained for 0.5-2 megapixel)

For Flux Schnell: 4 steps, CFG 1.0. ~5x faster than Dev.

SD 3.5 Setup {#sd-35}

Stability AI's SD 3.5 family works in Forge natively:

models/Stable-diffusion/sd3.5_large.safetensors        # 8 GB
                       or sd3.5_medium.safetensors     # 4 GB

You also need the SD 3 text encoders (CLIP-G, CLIP-L, T5-XXL). Forge prompts you on first run.

Recommended settings:

Sampler: Euler
Schedule: Simple
Steps: 28-40
CFG: 4.5

SD 3.5 Large requires ~16 GB VRAM at FP16; the Medium model fits in ~10 GB.

SDXL and SD 1.5 in Forge {#sdxl}

Identical workflow to A1111 — same UI, same parameters, same extensions. Speed advantage on the same hardware:

GPU	A1111 (sec)	Forge (sec)
RTX 3060 12GB	10	7
RTX 4090 24GB	4	3
RX 7900 XTX	9	6
M4 Max	22	16

SDXL Lightning + LCM samplers feel especially snappy in Forge.

LoRA / ControlNet / Embeddings {#lora-controlnet}

LoRAs: identical to A1111 — <lora:name:0.8> in prompt or use the LoRA tab.

ControlNet: Forge has built-in ControlNet integration that supersedes the A1111 extension on Forge. Drop ControlNet models in models/ControlNet/. Use via the ControlNet section in txt2img / img2img tab.

Forge's ControlNet supports SDXL, SD 1.5, SD 3.5, and Flux ControlNets natively — A1111 needs separate extensions / patches for each.

Embeddings: identical workflow.

For deep ControlNet workflows: see Automatic1111 Guide and ComfyUI Complete Guide.

NF4 Quantization for 6-8 GB VRAM {#nf4}

NF4 (NormalFloat 4-bit) is bitsandbytes-compatible quantization. Forge supports it natively for Flux:

models/Stable-diffusion/flux1-dev-bnb-nf4-v2.safetensors    # 6.5 GB

Quality vs FP8: ~95% — slight loss on fine detail and text rendering, comparable on overall composition. Speed: ~30-50% slower than FP8 due to dequant overhead but uses 50% less VRAM.

For 6 GB / 8 GB cards (RTX 3060 / 4060 / 4060 Ti), NF4 is the only practical Flux Dev option.

GGUF Quantization for Flux {#gguf}

GGUF Flux quants from city96 / leejet:

flux1-dev-Q8_0.gguf       # 13 GB — highest quality
flux1-dev-Q6_K.gguf       # 10 GB — balanced
flux1-dev-Q5_K_M.gguf     # 8.5 GB
flux1-dev-Q4_K_S.gguf     # 7 GB — tightest VRAM

Forge loads them via the standard checkpoint dropdown. Quality at Q8_0 is essentially identical to BF16; Q4_K_S has noticeable loss but fits 8 GB cards.

Extensions Compatibility {#extensions}

Most A1111 extensions work in Forge. Confirmed working in mid-2026:

sd-webui-controlnet (Forge has built-in but extension also works)
adetailer
sd-dynamic-prompts
sd-webui-additional-networks
a1111-sd-webui-tagcomplete
sd-webui-rembg
sd-webui-segment-anything

Known issues / Forge-specific patches needed:

multi-diffusion / tiled diffusion — use the Forge-specific fork
some script-based extensions that hook deep into the sampler

Install via Extensions tab → Available → Load from. Restart after install.

API Mode {#api}

Identical to A1111:

./webui.sh --api --listen

Endpoints: /sdapi/v1/txt2img, /sdapi/v1/img2img, etc. See A1111 API section.

Tuning Recipes {#tuning}

RTX 3060 12 GB

# webui-user.bat / .sh
COMMANDLINE_ARGS=--listen

(Forge auto-detects best settings; explicit flags rarely needed.)

RTX 4090 24 GB

COMMANDLINE_ARGS=--listen --api

Tight VRAM (6 GB)

COMMANDLINE_ARGS=--listen --always-low-vram

Use NF4 Flux for image gen.

Apple M4 Max

COMMANDLINE_ARGS=--listen --no-half-vae

AMD and Mac Setup {#amd-mac}

AMD (ROCm)

Use the lshqqytiger AMD-friendly fork: stable-diffusion-webui-amdgpu-forge. Same install path as Forge but with ROCm-specific patches. Performance on RX 7900 XTX: ~75-85% of equivalent NVIDIA. See AMD ROCm Setup.

Mac (Apple Silicon)

Native Forge works via MPS but is slower than discrete NVIDIA. For best Mac performance use Draw Things (MLX-native, separate app); for tabbed UI compatibility, stick with Forge.

Real Benchmarks {#benchmarks}

Workflow	A1111	Forge	ComfyUI
SDXL 1024² (RTX 4090)	4.0 sec	3.0 sec	2.8 sec
Flux Dev FP8 1024² (RTX 4090)	22 sec	9 sec	8 sec
Flux Schnell 1024² (RTX 4090)	8 sec	3 sec	3 sec
SDXL 1024² (RX 7900 XTX)	9 sec	6 sec	6 sec
Flux Dev NF4 (RTX 3060 12GB)	OOM	14 sec	16 sec
SDXL 1024² (M4 Max)	22 sec	16 sec	14 sec

Forge erases A1111's slowness on Flux while keeping the familiar UI.

Troubleshooting {#troubleshooting}

Symptom	Cause	Fix
Black images	NaN in VAE	`--no-half-vae`
OOM at Flux Dev BF16	Need 24 GB	Switch to FP8 or NF4
Extension breaks	Forge-incompatible	Look for Forge-specific fork
Slow first generation	Lazy load	Subsequent fast
ControlNet has no effect	Wrong base	SDXL ControlNet on SDXL only
AMD: hipErrorNoBinaryForGpu	Wrong fork	Use amdgpu-forge fork
Mac: incompatible PyTorch	Python ≠ 3.10	Reinstall Python 3.10

FAQ {#faq}

See answers to common SD Forge questions below.

Sources: SD Forge GitHub | reForge GitHub | lllyasviel's blog | Internal benchmarks RTX 3060, 4090, RX 7900 XTX, M4 Max.

Related guides:

SD Forge Guide (2026): Faster Automatic1111 with Native Flux and SD 3.5 Support

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What Forge Is and Why Use It {#what-it-is}

Forge vs reForge {#forge-vs-reforge}

Forge vs A1111 vs ComfyUI {#comparison}

Reading articles is good. Building is better.

Hardware Requirements {#requirements}

Installation: Windows, Linux, Mac {#installation}

Windows

Linux

Mac (Apple Silicon)

Migrating from A1111 {#migration}

The UNet Patcher and Dynamic Offloading {#unet-patcher}

Flux Dev / Schnell Setup {#flux}

Files needed

SD 3.5 Setup {#sd-35}

SDXL and SD 1.5 in Forge {#sdxl}

LoRA / ControlNet / Embeddings {#lora-controlnet}

NF4 Quantization for 6-8 GB VRAM {#nf4}

GGUF Quantization for Flux {#gguf}

Extensions Compatibility {#extensions}

API Mode {#api}

Tuning Recipes {#tuning}

RTX 3060 12 GB

RTX 4090 24 GB

Tight VRAM (6 GB)

Apple M4 Max

AMD and Mac Setup {#amd-mac}

AMD (ROCm)

Mac (Apple Silicon)

Real Benchmarks {#benchmarks}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Automatic1111 Guide

ComfyUI Complete Guide

Flux Local Image Generation

Local AI Photographers

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI