★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Image Generation

SD Forge Guide (2026): Faster Automatic1111 with Native Flux and SD 3.5 Support

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

SD Forge is what Automatic1111 looks like when someone rewrites the backend for memory and speed. lllyasviel — author of ControlNet, Foocus, and several Stable Diffusion architectural papers — built it as a faster A1111 with native Flux Dev support and dynamic VRAM offloading. The UI is identical, the extensions mostly carry over, but generation is 30-60% faster on the same hardware and a 6 GB GPU can run Flux NF4 where A1111 would OOM.

This guide covers everything: installation across platforms, the differences from A1111, the reForge community fork, native Flux Dev / Schnell / SD 3.5 support, NF4 / FP8 / GGUF quantization for low-VRAM users, ControlNet integration, extension compatibility, and migration from an existing A1111 setup.

Table of Contents

  1. What Forge Is and Why Use It
  2. Forge vs reForge
  3. Forge vs A1111 vs ComfyUI
  4. Hardware Requirements
  5. Installation: Windows, Linux, Mac
  6. Migrating from A1111
  7. The UNet Patcher and Dynamic Offloading
  8. Flux Dev / Schnell Setup
  9. SD 3.5 Setup
  10. SDXL and SD 1.5 in Forge
  11. LoRA / ControlNet / Embeddings
  12. NF4 Quantization for 6-8 GB VRAM
  13. GGUF Quantization for Flux
  14. Extensions Compatibility
  15. API Mode
  16. Tuning Recipes
  17. AMD and Mac Setup
  18. Real Benchmarks
  19. Troubleshooting
  20. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Forge Is and Why Use It {#what-it-is}

Stable Diffusion Forge ("Forge" for short) is a fork of Automatic1111 (A1111) maintained by lllyasviel since early 2024. The user interface is identical to A1111 — same tabs, same scripts, same extension API surface. Under the hood, the model loader, sampler scheduler, and memory manager are rewritten.

Headline benefits:

  • 30-60% faster generation on the same hardware vs A1111
  • Lower VRAM thanks to the UNet patcher (dynamic offload)
  • Native Flux Dev / Schnell / SD 3.5 / SD 3 support
  • Built-in FlashAttention without xformers dependency hell
  • NF4 / FP8 / GGUF Flux for tight VRAM budgets

Project: github.com/lllyasviel/stable-diffusion-webui-forge. License: AGPL-3.0.


Forge vs reForge {#forge-vs-reforge}

VariantMaintainerStatus (mid-2026)
Original ForgelllyasvielActive, conservative pace
reForgePanchovixActive, faster A1111 feature merges
sd-webui-forge-classiccommunityFrozen but stable

Most users in 2026: install reForge for the most up-to-date experience that combines original Forge's performance work with current A1111 features. Performance is essentially identical between the two.


Forge vs A1111 vs ComfyUI {#comparison}

PropertyA1111Forge / reForgeComfyUI
UITabbedTabbed (same as A1111)Node graph
SDXL speed (RTX 4090)4 sec/image~3 sec/image~3 sec/image
Flux Dev speedLimited supportNative, fastNative, fast
Min VRAM for Flux Dev24 GB (FP8)6 GB (NF4)8 GB (NF4 / GGUF)
Extension compatibilityAll A1111Most A1111ComfyUI ecosystem
Workflow flexibilityMediumMediumHighest
Learning curveEasyEasy (same as A1111)Steep

For users transitioning from A1111 to Flux: Forge is the no-brainer choice. For brand-new users wanting maximum flexibility: ComfyUI.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Hardware Requirements {#requirements}

GPU VRAMModels you can run
4 GBSD 1.5 only
6 GBSD 1.5, SDXL Lightning, Flux NF4
8 GBSDXL, SD 3.5 Medium, Flux GGUF Q4
12 GBSDXL with refiner, SD 3.5 Large, Flux FP8
16 GBAll SDXL, Flux Dev FP8
24 GBFlux Dev BF16, large LoRA stacks

Forge's memory advantage over A1111 comes from the UNet patcher — see section.


Installation: Windows, Linux, Mac {#installation}

Windows

# Install Python 3.10.6, git
git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
.\webui-user.bat

First run takes 10-20 minutes (PyTorch + dependencies).

Linux

git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
./webui.sh

Mac (Apple Silicon)

git clone https://github.com/Panchovix/stable-diffusion-webui-reForge
cd stable-diffusion-webui-reForge
./webui.sh

MPS auto-detected. Performance lags NVIDIA — Draw Things (MLX-native) is faster on Mac.


Migrating from A1111 {#migration}

Keep your existing A1111 install. Install Forge to a separate directory. Edit Forge's webui-user.bat / .sh to add:

COMMANDLINE_ARGS=--ckpt-dir /path/to/A1111/models/Stable-diffusion \
                 --lora-dir /path/to/A1111/models/Lora \
                 --vae-dir /path/to/A1111/models/VAE \
                 --controlnet-dir /path/to/A1111/extensions/sd-webui-controlnet/models \
                 --embeddings-dir /path/to/A1111/embeddings

All your models / LoRAs / VAEs / ControlNet / embeddings appear in Forge unchanged. No conversion needed.

For extensions, install the same ones via Forge's Extensions tab → most "just work."


The UNet Patcher and Dynamic Offloading {#unet-patcher}

The UNet patcher tracks VRAM usage in real time and offloads layers to system RAM only when needed. Compared to A1111's --medvram (offload everything always) or --lowvram (offload aggressively), Forge's approach is dynamic and per-layer.

Result on a 12 GB GPU running Flux Dev FP8:

  • A1111 with --lowvram: ~25 sec / image, full offload
  • Forge default: ~12 sec / image, partial offload as needed

You do NOT need to set memory flags in Forge — it figures out the right behavior automatically.

For very tight VRAM you can still hint:

COMMANDLINE_ARGS=--always-low-vram          # force low-VRAM mode
COMMANDLINE_ARGS=--cuda-malloc              # fallback to default allocator

Flux Dev / Schnell Setup {#flux}

Files needed

models/Stable-diffusion/flux1-dev.safetensors          # 24 GB BF16
                       or flux1-dev-fp8.safetensors    # 12 GB FP8
                       or flux1-dev-Q8_0.gguf          # 13 GB GGUF
                       or flux1-dev-bnb-nf4-v2.safetensors  # 6.5 GB NF4
models/Stable-diffusion/flux1-schnell.safetensors      # 24 GB BF16

models/VAE/ae.safetensors                              # Flux VAE

models/text_encoder/clip_l.safetensors                 # 246 MB
models/text_encoder/t5xxl_fp16.safetensors             # 9.8 GB (or fp8 at 5 GB)

For Flux specifically, Forge has a "Diffusion in Low Bits" dropdown in the UI that lets you pick automatic offloading strategies (NF4, FP8, BF16, FP16). Select it before generating.

Recommended Flux Dev settings:

  • Sampler: Euler
  • Schedule: Simple
  • Steps: 20
  • CFG: 1.0
  • Resolution: 1024x1024 (Flux is trained for 0.5-2 megapixel)

For Flux Schnell: 4 steps, CFG 1.0. ~5x faster than Dev.


SD 3.5 Setup {#sd-35}

Stability AI's SD 3.5 family works in Forge natively:

models/Stable-diffusion/sd3.5_large.safetensors        # 8 GB
                       or sd3.5_medium.safetensors     # 4 GB

You also need the SD 3 text encoders (CLIP-G, CLIP-L, T5-XXL). Forge prompts you on first run.

Recommended settings:

  • Sampler: Euler
  • Schedule: Simple
  • Steps: 28-40
  • CFG: 4.5

SD 3.5 Large requires ~16 GB VRAM at FP16; the Medium model fits in ~10 GB.


SDXL and SD 1.5 in Forge {#sdxl}

Identical workflow to A1111 — same UI, same parameters, same extensions. Speed advantage on the same hardware:

GPUA1111 (sec)Forge (sec)
RTX 3060 12GB107
RTX 4090 24GB43
RX 7900 XTX96
M4 Max2216

SDXL Lightning + LCM samplers feel especially snappy in Forge.


LoRA / ControlNet / Embeddings {#lora-controlnet}

LoRAs: identical to A1111 — <lora:name:0.8> in prompt or use the LoRA tab.

ControlNet: Forge has built-in ControlNet integration that supersedes the A1111 extension on Forge. Drop ControlNet models in models/ControlNet/. Use via the ControlNet section in txt2img / img2img tab.

Forge's ControlNet supports SDXL, SD 1.5, SD 3.5, and Flux ControlNets natively — A1111 needs separate extensions / patches for each.

Embeddings: identical workflow.

For deep ControlNet workflows: see Automatic1111 Guide and ComfyUI Complete Guide.


NF4 Quantization for 6-8 GB VRAM {#nf4}

NF4 (NormalFloat 4-bit) is bitsandbytes-compatible quantization. Forge supports it natively for Flux:

models/Stable-diffusion/flux1-dev-bnb-nf4-v2.safetensors    # 6.5 GB

Quality vs FP8: ~95% — slight loss on fine detail and text rendering, comparable on overall composition. Speed: ~30-50% slower than FP8 due to dequant overhead but uses 50% less VRAM.

For 6 GB / 8 GB cards (RTX 3060 / 4060 / 4060 Ti), NF4 is the only practical Flux Dev option.


GGUF Quantization for Flux {#gguf}

GGUF Flux quants from city96 / leejet:

flux1-dev-Q8_0.gguf       # 13 GB — highest quality
flux1-dev-Q6_K.gguf       # 10 GB — balanced
flux1-dev-Q5_K_M.gguf     # 8.5 GB
flux1-dev-Q4_K_S.gguf     # 7 GB — tightest VRAM

Forge loads them via the standard checkpoint dropdown. Quality at Q8_0 is essentially identical to BF16; Q4_K_S has noticeable loss but fits 8 GB cards.


Extensions Compatibility {#extensions}

Most A1111 extensions work in Forge. Confirmed working in mid-2026:

  • sd-webui-controlnet (Forge has built-in but extension also works)
  • adetailer
  • sd-dynamic-prompts
  • sd-webui-additional-networks
  • a1111-sd-webui-tagcomplete
  • sd-webui-rembg
  • sd-webui-segment-anything

Known issues / Forge-specific patches needed:

  • multi-diffusion / tiled diffusion — use the Forge-specific fork
  • some script-based extensions that hook deep into the sampler

Install via Extensions tab → Available → Load from. Restart after install.


API Mode {#api}

Identical to A1111:

./webui.sh --api --listen

Endpoints: /sdapi/v1/txt2img, /sdapi/v1/img2img, etc. See A1111 API section.


Tuning Recipes {#tuning}

RTX 3060 12 GB

# webui-user.bat / .sh
COMMANDLINE_ARGS=--listen

(Forge auto-detects best settings; explicit flags rarely needed.)

RTX 4090 24 GB

COMMANDLINE_ARGS=--listen --api

Tight VRAM (6 GB)

COMMANDLINE_ARGS=--listen --always-low-vram

Use NF4 Flux for image gen.

Apple M4 Max

COMMANDLINE_ARGS=--listen --no-half-vae

AMD and Mac Setup {#amd-mac}

AMD (ROCm)

Use the lshqqytiger AMD-friendly fork: stable-diffusion-webui-amdgpu-forge. Same install path as Forge but with ROCm-specific patches. Performance on RX 7900 XTX: ~75-85% of equivalent NVIDIA. See AMD ROCm Setup.

Mac (Apple Silicon)

Native Forge works via MPS but is slower than discrete NVIDIA. For best Mac performance use Draw Things (MLX-native, separate app); for tabbed UI compatibility, stick with Forge.


Real Benchmarks {#benchmarks}

WorkflowA1111ForgeComfyUI
SDXL 1024² (RTX 4090)4.0 sec3.0 sec2.8 sec
Flux Dev FP8 1024² (RTX 4090)22 sec9 sec8 sec
Flux Schnell 1024² (RTX 4090)8 sec3 sec3 sec
SDXL 1024² (RX 7900 XTX)9 sec6 sec6 sec
Flux Dev NF4 (RTX 3060 12GB)OOM14 sec16 sec
SDXL 1024² (M4 Max)22 sec16 sec14 sec

Forge erases A1111's slowness on Flux while keeping the familiar UI.


Troubleshooting {#troubleshooting}

SymptomCauseFix
Black imagesNaN in VAE--no-half-vae
OOM at Flux Dev BF16Need 24 GBSwitch to FP8 or NF4
Extension breaksForge-incompatibleLook for Forge-specific fork
Slow first generationLazy loadSubsequent fast
ControlNet has no effectWrong baseSDXL ControlNet on SDXL only
AMD: hipErrorNoBinaryForGpuWrong forkUse amdgpu-forge fork
Mac: incompatible PyTorchPython ≠ 3.10Reinstall Python 3.10

FAQ {#faq}

See answers to common SD Forge questions below.


Sources: SD Forge GitHub | reForge GitHub | lllyasviel's blog | Internal benchmarks RTX 3060, 4090, RX 7900 XTX, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Forge + Flux reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators