What is ComfyUI and why is it the dominant Stable Diffusion frontend in 2026?

ComfyUI is a node-based graphical interface for diffusion models. Each step (load checkpoint, encode prompt, sample, decode VAE, save image) is a node, and you wire them into a graph. This visual programming model is more flexible than Automatic1111 / Forge / Fooocus because it exposes every internal step — making it the default choice for advanced workflows like IPAdapter chains, ControlNet stacks, regional prompting, multi-stage refiners, and video pipelines (AnimateDiff, Wan 2.x, HunyuanVideo, Mochi). It also has the fastest support for new models — Flux, SD 3.5, Wan 2.2, and Hunyuan all shipped with ComfyUI workflows on day one.

What hardware do I need for ComfyUI?

Minimum: NVIDIA GPU with 6 GB VRAM (SD 1.5 only), 16 GB system RAM. Recommended: RTX 3060 12 GB or RTX 4070 12 GB for SDXL and Flux Schnell. Ideal: RTX 4090 24 GB or RTX 5090 32 GB for Flux Dev, SD 3.5 Large, and video models. AMD Radeon 7900 XTX works via ROCm with 70-85% of NVIDIA performance — see our [AMD ROCm guide](/blog/amd-rocm-local-llm-setup). Apple Silicon (M2 or newer) works via MPS but is 3-5x slower than NVIDIA on most workloads. CPU-only is not practical — generation times measured in tens of minutes per image.

How is ComfyUI different from Automatic1111, Forge, and Fooocus?

Automatic1111 (A1111) is the original tab-based UI — easiest for beginners, slower with new model support. Forge is a fork of A1111 with significantly faster sampling and lower VRAM use; great middle-ground. Fooocus is a stripped-down ComfyUI backend with a one-click UI optimized for SDXL — best for "just give me good images" users. ComfyUI is the most flexible and the fastest to support new models, but has a steeper learning curve. For most serious users in 2026: start in Fooocus or Forge, graduate to ComfyUI when you need ControlNet stacks, IPAdapter, or video.

How do I install ComfyUI on Windows / Linux / Mac?

Easiest: download the portable Windows build (`ComfyUI_windows_portable.7z`) which includes Python, PyTorch with CUDA, and ComfyUI. Just unzip and run `run_nvidia_gpu.bat`. For Linux/Mac: `git clone https://github.com/comfyanonymous/ComfyUI && cd ComfyUI && python -m venv venv && source venv/bin/activate && pip install -r requirements.txt && python main.py`. PyTorch index URL must match your hardware: `https://download.pytorch.org/whl/cu124` for NVIDIA, `https://download.pytorch.org/whl/rocm6.2` for AMD, default for Mac MPS. After install, place `.safetensors` model files in `models/checkpoints/` and they appear in the load-checkpoint dropdown.

What is ComfyUI Manager and why do I need it?

ComfyUI Manager is a custom node that adds a Manager button to the UI for installing other custom nodes, missing models, and updates. Without it you must clone repos manually into `custom_nodes/`. With it, you click "Install Missing Custom Nodes" when you load a workflow that needs them, click "Install Models" for missing checkpoints/LoRAs, and "Update All" to keep everything current. Install it once via `git clone https://github.com/ltdrdata/ComfyUI-Manager custom_nodes/ComfyUI-Manager` and restart. It is effectively mandatory for any non-trivial workflow.

Should I use Flux, SDXL, or SD 3.5 in 2026?

Flux Dev is the best general-purpose model in 2026 — superb prompt adherence, photorealism, and text rendering, but 12B parameters means ~24 GB VRAM in BF16 (or ~12 GB with Q8 GGUF). Flux Schnell is the 4-step distilled version — much faster, slightly lower quality. SDXL is still the best for fine-tuning and LoRA ecosystem (10,000+ community LoRAs); use it when you need style consistency or character work. SD 3.5 Large is between them on quality and has a more permissive license than Flux. For one-pick-fits-all: Flux Dev for quality, SDXL for LoRAs, SD 3.5 Large when license matters.

How do I run video generation (Wan 2.2, HunyuanVideo, Mochi) in ComfyUI?

All three have official ComfyUI nodes. Wan 2.2 is the easiest entry — runs in ~24 GB VRAM with Q8 GGUF and produces 5-10 second 720p clips at 24fps. HunyuanVideo is higher quality but needs ~40 GB VRAM unquantized (or 24 GB at Q4); use the GGUF nodes from the kijai/ComfyUI-HunyuanVideoWrapper repo. Mochi is the fastest and lowest-VRAM option (~16 GB). Workflow templates ship in `ComfyUI/workflows/video/` after the November 2025 update. Render times: Wan 2.2 5-second clip on RTX 4090 ≈ 4-8 minutes; HunyuanVideo same length ≈ 12-20 minutes.

How do I save and share ComfyUI workflows?

ComfyUI embeds the entire workflow JSON inside every PNG it saves — drag any image generated by ComfyUI back into the canvas and the full workflow loads. To share without an image, use Save (top right) → JSON. You can also save as API format for programmatic use via the `/prompt` REST endpoint. The community shares workflows via OpenArt.ai, Civitai workflows tab, and r/comfyui. When loading someone else's workflow, ComfyUI Manager prompts to install any missing custom nodes automatically.

ComfyUI Complete Guide (2026): Install, Workflows, ControlNet, Flux, SDXL

ComfyUI is the most powerful frontend for local image and video generation in 2026 — a node-based interface that exposes every internal step of diffusion models, supports new architectures on day one, and ships with the fastest sampling implementations. This guide covers everything: installation across NVIDIA / AMD / Mac, the node graph mental model, ControlNet, IPAdapter, regional prompting, LoRA stacks, Flux / SDXL / SD 3.5 workflows, video generation (Wan, HunyuanVideo, Mochi), and serious tuning for 24 GB and below.

Whether you are coming from Automatic1111, Forge, or fresh to local image gen, this is the reference.

What ComfyUI Is and Why It Wins
Hardware Requirements
Installation: Windows, Linux, macOS
Folder Layout & Where Models Go
ComfyUI Manager — Mandatory First Install
The Node Graph Mental Model
Your First Workflow: SDXL Text-to-Image
Models in 2026: Flux, SDXL, SD 3.5, Pony, Illustrious
LoRAs and Embeddings
ControlNet — Composition, Pose, Depth, Canny
IPAdapter — Image Prompts and Style Transfer
Inpainting, Outpainting, and Upscaling
Regional Prompting and Conditioning Combine
Flux: Dev, Schnell, GGUF, Quantized
Video Generation: Wan 2.2, HunyuanVideo, Mochi
API Mode and Programmatic Use
VRAM Optimization Tricks
Speed Tuning: Sage Attention, Triton, Compile
Common Custom Node Packs Worth Installing
Troubleshooting
FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

What ComfyUI Is and Why It Wins {#what-is}

Diffusion image generation is a pipeline:

Prompt → Text Encoder → Conditioning
                        ↓
Empty Latent → Sampler ← Model (UNet/DiT)
                        ↓
                       VAE Decode → Image

A1111 / Forge / Fooocus hide this pipeline behind a tabbed UI. ComfyUI exposes it as a graph: nodes for "Load Checkpoint," "CLIP Text Encode," "KSampler," "VAE Decode," "Save Image," with explicit data flow between them. You can branch, merge, swap, and chain nodes arbitrarily.

This is why ComfyUI is the fastest to support new models: when Flux launched, supporting it was a matter of writing a new "Load Diffusion Model" node and a new sampler — no UI surgery required. Same for SD 3.5, Wan 2.2, HunyuanVideo, and every model since 2023.

The trade-off: more upfront learning. But every workflow is a JSON file you can load with one drag-and-drop. The community ships pre-built workflows for almost every common task.

Hardware Requirements {#requirements}

Tier	GPU	Models You Can Run
Minimum	6 GB VRAM	SD 1.5 only
Entry	8 GB VRAM (RTX 3060 8GB)	SDXL Q8, SD 1.5
Recommended	12 GB (RTX 3060 12GB / 4070)	SDXL FP16, Flux Schnell GGUF Q4
Sweet spot	16 GB (RTX 4080 / 5070 Ti)	Flux Dev FP8, SD 3.5 Large
High-end	24 GB (RTX 3090 / 4090)	Flux Dev BF16, Wan 2.2, HunyuanVideo Q4
Top	32+ GB (RTX 5090 / RTX 6000 Ada)	HunyuanVideo BF16, full Mochi

RAM: at least 2x your largest model file in system RAM for offload buffers. 32 GB is the realistic minimum, 64 GB recommended for video.

Disk: Flux Dev = 24 GB, SDXL = 7 GB, SD 1.5 = 4 GB, plus VAEs (300 MB each), CLIP (1-5 GB), LoRAs (10-500 MB each), ControlNet (1.5 GB each). Plan for 200-500 GB on NVMe.

AMD: RX 7900 XTX works via ROCm at 70-85% of equivalent NVIDIA speed. See AMD ROCm Setup.

Apple Silicon: M2 or newer via MPS, but 3-5x slower than NVIDIA. Use MLX-based alternatives like Draw Things for better performance.

Installation: Windows, Linux, macOS {#installation}

Windows (portable — recommended)

Download the latest ComfyUI_windows_portable.7z from github.com/comfyanonymous/ComfyUI/releases.
Extract with 7-Zip to e.g. D:\ComfyUI\.
Run run_nvidia_gpu.bat (NVIDIA) or run_cpu.bat (no GPU).
Browser opens to http://127.0.0.1:8188.

Linux / Mac (git)

git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
python3.11 -m venv venv
source venv/bin/activate

# NVIDIA CUDA 12.4
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# AMD ROCm 6.2
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.2

# Mac MPS
pip install torch torchvision

# ComfyUI dependencies
pip install -r requirements.txt

# Run
python main.py --listen 0.0.0.0

Docker

docker run --gpus all \
    -p 8188:8188 \
    -v $(pwd)/models:/app/models \
    -v $(pwd)/output:/app/output \
    -v $(pwd)/workflows:/app/workflows \
    yanwk/comfyui-boot:latest

yanwk/comfyui-boot includes Manager, common custom nodes, and starter models.

Useful launch flags

Flag	Purpose
`--listen 0.0.0.0`	Accept connections from LAN
`--port 8188`	Change port
`--lowvram`	Aggressive memory offload (12 GB and below)
`--novram`	CPU offload everything (only for emergencies)
`--use-pytorch-cross-attention`	Force PyTorch SDPA (most stable)
`--use-sage-attention`	Use Sage Attention (faster, see Speed Tuning)
`--fast`	Enable FP8 ops on Ada/Hopper/Blackwell
`--cache-classic`	Old caching behavior (workaround for some custom nodes)
`--preview-method auto`	Show in-progress previews during sampling

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

Folder Layout & Where Models Go {#folder-layout}

ComfyUI/
├── models/
│   ├── checkpoints/      # SD 1.5, SDXL, SD 3.5, Pony, Illustrious .safetensors
│   ├── unet/             # Flux UNet/DiT files
│   ├── diffusion_models/ # Newer alias for unet/
│   ├── clip/             # T5 / CLIP-L / CLIP-G text encoders
│   ├── vae/              # VAE decoders
│   ├── loras/            # LoRA files
│   ├── controlnet/       # ControlNet models
│   ├── embeddings/       # Textual inversion embeddings
│   ├── ipadapter/        # IPAdapter models
│   ├── upscale_models/   # 4x-UltraSharp, RealESRGAN, etc.
│   └── animatediff_models/
├── custom_nodes/         # Third-party node packs
├── workflows/            # Saved JSON workflows
├── input/                # Images for img2img, ControlNet
└── output/               # Generated images

To share models between A1111/Forge and ComfyUI without duplicating files:

# Edit ComfyUI/extra_model_paths.yaml
a111:
    base_path: /path/to/stable-diffusion-webui/
    checkpoints: models/Stable-diffusion
    loras: models/Lora
    controlnet: models/ControlNet
    upscale_models: models/ESRGAN
    embeddings: embeddings
    vae: models/VAE

ComfyUI Manager — Mandatory First Install {#manager}

cd ComfyUI/custom_nodes
git clone https://github.com/ltdrdata/ComfyUI-Manager

Restart ComfyUI. The "Manager" button appears on the right sidebar.

What it does:

Install Missing Custom Nodes — when you load someone else's workflow.
Install Models — auto-downloads missing checkpoints/LoRAs/ControlNets.
Update All — keeps ComfyUI and all custom nodes current.
Snapshot / Restore — versioned backups before risky updates.
Disable / Uninstall — surgically remove a custom node.

Without Manager, every missing node is a manual git clone. Install it first; do not skip.

The Node Graph Mental Model {#node-graph}

A workflow is a directed graph. Each node:

Has inputs (connectors on the left)
Has outputs (connectors on the right)
Performs a function (load model, encode text, sample, decode VAE)

Connections carry typed values:

MODEL — diffusion model
CLIP — text encoder
VAE — image encoder/decoder
CONDITIONING — encoded prompt
LATENT — latent image (compressed representation)
IMAGE — pixel image
MASK — binary mask
CONTROL_NET — ControlNet model
STRING, INT, FLOAT — primitives

The default workflow looks like:

[Load Checkpoint] → MODEL ─┐
                  → CLIP   │
                  → VAE ───┼──┐
                          │  │
[CLIP Text Encode pos] ←──┘  │
        ↓ CONDITIONING        │
[KSampler] ←──────────────────┘
   ↑ LATENT (from Empty Latent Image)
   ↓ LATENT (after sampling)
[VAE Decode] → IMAGE
   ↓
[Save Image]

Master this and you can build anything. Common patterns:

Two samplers in series — base + refiner (SDXL).
Conditioning combine — merge two prompts with weights.
ControlNet apply — modulate conditioning with control image.
Latent composite — blend two latents before decoding.

Your First Workflow: SDXL Text-to-Image {#first-workflow}

Download sd_xl_base_1.0.safetensors from Hugging Face → put in models/checkpoints/.
Open ComfyUI → Load Default workflow (right sidebar).
In Load Checkpoint node, select sd_xl_base_1.0.safetensors.
In Empty Latent Image, set width/height to 1024×1024.
In CLIP Text Encode (Positive), write your prompt: e.g. "cinematic photo of a samurai standing in a misty forest, ultra-detailed, 35mm film".
In CLIP Text Encode (Negative), write: "blurry, deformed, extra fingers, low quality, bad anatomy".
In KSampler, set steps=25, cfg=7.0, sampler=dpmpp_2m, scheduler=karras.
Click Queue Prompt.

Expected time on RTX 4090: ~3-5 seconds per 1024² image.

Models in 2026: Flux, SDXL, SD 3.5, Pony, Illustrious {#models}

Model	Params	License	VRAM (BF16)	Strengths
Flux Dev	12B	Non-commercial	~24 GB	Best general quality, prompt adherence
Flux Schnell	12B	Apache 2.0	~24 GB	4-step distilled — fastest top-tier
SD 3.5 Large	8B	Stability AI Community	~16 GB	Permissive license, strong prompts
SD 3.5 Medium	2.5B	Stability AI Community	~10 GB	Lower VRAM, decent quality
SDXL 1.0 / Lightning	3.5B	OpenRAIL	~10 GB	Largest LoRA ecosystem
Pony Diffusion v6 XL	3.5B	Fair AI	~10 GB	Anime / character / NSFW
Illustrious XL	3.5B	Fair AI	~10 GB	Anime, cleaner than Pony
SD 1.5	0.9B	OpenRAIL	~4 GB	Legacy, fast iteration

Flux GGUF / FP8 quantized

Flux Dev in BF16 needs ~24 GB. To run on 12-16 GB:

Flux Dev FP8 (flux1-dev-fp8.safetensors) — ~12 GB, near-identical quality. Use --fast flag.
Flux Dev GGUF Q8_0 — ~13 GB.
Flux Dev GGUF Q4_K_S — ~7 GB. Use the ComfyUI-GGUF custom node by city96.

[UnetLoader (GGUF)] → MODEL
[DualCLIPLoader (GGUF)] → CLIP    # T5-XXL + CLIP-L
[Load VAE] → VAE

LoRAs and Embeddings {#loras}

LoRA chains

[Load Checkpoint] → MODEL ──────┐
                  → CLIP ───────┤
                                ↓
                       [Load LoRA #1]
                            ↓
                       [Load LoRA #2]
                            ↓
                       [Load LoRA #3]
                            ↓
                          MODEL → KSampler
                          CLIP → CLIP Text Encode

Each LoRA node has strength_model and strength_clip (0.0-1.5 typical, 1.0 default). Stack as many as you want — but >3 strong LoRAs usually conflict.

Embeddings (textual inversion)

Reference in your prompt with embedding:filename (without extension):

masterpiece, beautiful landscape, embedding:negative_easy in negative

Place .pt / .safetensors files in models/embeddings/.

Best LoRA sources in 2026

Civitai — largest collection, NSFW filter optional.
Hugging Face — official model authors.
Tensor.Art — curated workflows.

Always check the LoRA's recommended trigger words and base model — an SDXL LoRA does not work on Flux.

ControlNet — Composition, Pose, Depth, Canny {#controlnet}

ControlNet conditions generation on a structural input (pose, depth map, edges, etc.).

Pattern

[Load Image] → IMAGE → [Canny Preprocessor] → IMAGE
                                                ↓
[Load ControlNet (canny)] → CONTROL_NET ─────┐ │
                                              ↓ ↓
[CLIP Text Encode pos] → CONDITIONING → [Apply ControlNet] → CONDITIONING
                                                      ↓
                                                [KSampler]

Common preprocessors

Type	Use Case
Canny	Preserve edges from reference image
Depth (MiDaS, Marigold)	Match 3D structure
OpenPose	Match human pose
DWPose	Higher-quality OpenPose alternative
LineArt	Line drawings, anime
Scribble	Rough sketches → finished images
Tile	Upscale-friendly preservation
Reference	Style match without LoRA
InstantID / IP-Composition	Face / composition transfer

SDXL vs Flux ControlNet

SDXL has the most mature ControlNet ecosystem (xinsir, kohya, lllyasviel). Flux ControlNets are catching up — InstantX, Shakker-Labs, and Black Forest Labs ship Flux ControlNets but coverage is narrower.

IPAdapter — Image Prompts and Style Transfer {#ipadapter}

IPAdapter conditions generation on a reference image (instead of a text prompt). Best for style transfer, character consistency, and "make it look like this" workflows.

[Load Image (reference)] → IMAGE
                            ↓
[IPAdapter Unified Loader] → MODEL, IPADAPTER
                                   ↓
[IPAdapter Advanced] ← MODEL ←─────┘
        ↓ MODEL
[KSampler]

Use the ComfyUI_IPAdapter_plus custom node by cubiq.

IPAdapter strength

Strength	Effect
0.3	Subtle style hint
0.6	Clear style influence
0.9-1.0	Strong reference dominance
1.2+	Reference overrides prompt

IPAdapter FaceID

For consistent character across generations: ip-adapter-faceid-portrait_sdxl.bin + face embedding extracted with InsightFace. One reference image → consistent character in any pose / scene / outfit.

Inpainting, Outpainting, and Upscaling {#inpainting}

Inpainting

[Load Image] → IMAGE
[Load Image (mask)] → IMAGE
[Image to Mask] → MASK
[VAE Encode (Inpaint)] → LATENT (with masked region noised)
[KSampler] (with denoise=0.8) → LATENT
[VAE Decode] → IMAGE

Use the dedicated inpainting checkpoint (e.g. sd_xl_base_1.0_inpainting_0.1.safetensors) for best results. Set sampler denoise to 0.7-0.95.

Outpainting

Use Pad Image for Outpainting node → mask the new edges → inpaint.

Upscaling

Two stages:

Latent upscale (cheap, blurry) — Latent Upscale by node, factor 2.0.
Image upscale model (sharp) — Upscale Image (using Model) with 4x-UltraSharp or RealESRGAN_x4plus_anime_6B for anime.

Or iterative upscale: small image → upscale 1.5x → low-denoise sampler pass → upscale 1.5x again. Best quality, slowest.

Regional Prompting and Conditioning Combine {#regional}

To prompt different regions of the image differently (left side: knight, right side: wizard):

Use ComfyUI_Cutoff or ComfyUI-RegionalPrompter custom nodes. Pattern:

[CLIP Text Encode "knight"] → COND_LEFT
[CLIP Text Encode "wizard"] → COND_RIGHT
[Conditioning (Set Area)] (left half) → COND_LEFT_AREA
[Conditioning (Set Area)] (right half) → COND_RIGHT_AREA
[Conditioning Combine] → COND_FINAL → KSampler

Areas are specified in pixel coordinates. Resolution must match Empty Latent Image size.

Flux: Dev, Schnell, GGUF, Quantized {#flux}

Flux is a 12B-parameter Diffusion Transformer (DiT) — different architecture from SD's UNet, with better prompt adherence and text rendering.

Files needed

models/unet/flux1-dev.safetensors           # 24 GB BF16
models/clip/t5xxl_fp16.safetensors          # 9.8 GB
models/clip/clip_l.safetensors              # 246 MB
models/vae/ae.safetensors                   # 335 MB

For 16 GB VRAM use flux1-dev-fp8.safetensors (12 GB) and t5xxl_fp8_e4m3fn.safetensors (5 GB).

Workflow

[Load Diffusion Model] → MODEL (flux1-dev)
[DualCLIPLoader] → CLIP (clip_l + t5xxl)
[Load VAE] → VAE (ae.safetensors)
[CLIP Text Encode] → CONDITIONING
[Empty Latent Image] (1024×1024) → LATENT
[KSamplerAdvanced] (20 steps, cfg=1.0, euler, simple scheduler)
[VAE Decode] → IMAGE

Flux uses cfg=1.0 (no classifier-free guidance) — set CFG to 1.0 always. Different from SD which uses CFG 5-10.

Flux Schnell

Same workflow, but use flux1-schnell.safetensors and 4 steps. ~5x faster, slightly lower quality.

Flux LoRA

[Load LoRA] (after Load Diffusion Model)

Most Flux LoRAs work at 0.7-1.0 strength. Civitai now has 3,000+ Flux LoRAs.

Video Generation: Wan 2.2, HunyuanVideo, Mochi {#video}

Wan 2.2 (recommended starting point)

Alibaba's open-source video model. 5-10 second 720p clips, ~24 GB VRAM at Q8 GGUF.

[UnetLoader (GGUF)] → MODEL (wan2.2-i2v-q8_0.gguf)
[DualCLIPLoader] → CLIP (umt5-xxl encoder)
[Load VAE] → VAE
[Load Image] → IMAGE (first frame for image-to-video)
[CLIP Text Encode] → CONDITIONING
[WanImageToVideo] → LATENT (sequence)
[KSampler] (30 steps)
[VAE Decode (sequence)] → IMAGES
[Video Combine (FFmpeg)] → MP4

Render time on RTX 4090: ~6-10 minutes for 5 seconds at 720p.

HunyuanVideo

Tencent's 13B video DiT. Highest quality, ~40 GB BF16 (fits in 24 GB at Q4 GGUF). Use kijai/ComfyUI-HunyuanVideoWrapper.

Mochi

Genmo's 10B model. Lower VRAM (~16 GB), faster, slightly lower quality than Hunyuan.

Frame interpolation and upscaling

After generating a 24fps video:

RIFE for interpolation to 60fps (ComfyUI-Frame-Interpolation).
Real-ESRGAN x4 for upscaling to 1440p / 4K.

API Mode and Programmatic Use {#api}

ComfyUI exposes POST /prompt for queuing workflows programmatically.

import json
import requests
import uuid

# Load workflow JSON saved from UI (Save (API Format))
with open("workflow_api.json") as f:
    workflow = json.load(f)

# Modify any node — e.g., set positive prompt
workflow["6"]["inputs"]["text"] = "a cyberpunk samurai at dawn"
workflow["3"]["inputs"]["seed"] = 42

# Submit
client_id = str(uuid.uuid4())
resp = requests.post("http://127.0.0.1:8188/prompt", json={
    "prompt": workflow,
    "client_id": client_id,
})
prompt_id = resp.json()["prompt_id"]

# Poll history
history = requests.get(f"http://127.0.0.1:8188/history/{prompt_id}").json()

WebSocket (/ws) streams progress events. Output images are at /view?filename=...&subfolder=....

VRAM Optimization Tricks {#vram-optimization}

Trick	VRAM Saved	Quality Cost
FP8 weights (`--fast`)	50%	<1%
GGUF Q8_0	50%	<1%
GGUF Q4_K_S	75%	3-5%
Tile VAE Decode	~30% peak	0%
Sequential Offload (`--lowvram`)	70%	-30% speed
CFG Rescale (skip CFG dup)	50% during sampling	0%
BFloat16 over FP32	50%	0%
Reduce resolution then upscale	proportional	varies

For 12 GB VRAM running Flux Dev: --fast + Q8 T5 + Tile VAE + --lowvram if needed.

Speed Tuning: Sage Attention, Triton, Compile {#speed-tuning}

Sage Attention

pip install sageattention
python main.py --use-sage-attention

Sage Attention is a faster attention implementation than xformers / SDPA on Ada and Blackwell. 15-30% throughput improvement on Flux and SDXL.

Triton (Linux/WSL2)

pip install triton

Triton enables more efficient kernels. Required by Sage Attention and several custom node packs. Windows native does not officially support Triton; use WSL2.

torch.compile

Some custom node packs (kijai's wrappers, ComfyUI-MultiGPU) expose torch.compile modes. Compilation takes 1-3 minutes on first run but generation is 10-25% faster afterward. Mode max-autotune is fastest but adds ~5 min compile time.

TeaCache / FBCache

Caches diffusion model attention/MLP outputs across consecutive timesteps. 1.5-2.0x speedup with 1-3% quality loss. Custom nodes: ComfyUI-TeaCache, ComfyUI-FBCache.

Common Custom Node Packs Worth Installing {#custom-nodes}

Pack	Purpose
ComfyUI-Manager	Mandatory
ComfyUI_IPAdapter_plus	IPAdapter
ComfyUI-Advanced-ControlNet	Better ControlNet
ComfyUI-Impact-Pack	Detailers, face fix
rgthree-comfy	Better UI nodes (mute, group bypass)
ComfyUI-GGUF	GGUF-quantized model loaders
ComfyUI-Frame-Interpolation	RIFE / FILM video frame interp
ComfyUI-VideoHelperSuite	Video I/O
ComfyUI-TeaCache	2x speed for diffusion
was-node-suite-comfyui	Misc utility nodes
ComfyUI-KJNodes	kijai's nodes for Wan, Hunyuan, Mochi
ComfyUI-MultiGPU	Run encoder on GPU 1, UNet on GPU 0
ComfyUI-Custom-Scripts	UI improvements

Troubleshooting {#troubleshooting}

Symptom	Cause	Fix
OOM on first generation	VRAM too tight	Add `--lowvram` or use FP8/GGUF
Black output	NaN in VAE	Switch to fp16 VAE, or `--no-half-vae`
Workflow won't load	Missing custom nodes	Manager → Install Missing Custom Nodes
Very slow on RTX 40	Not using FP8	Add `--fast` flag
Flux looks washed out	Wrong sampler	Use `euler` + `simple`, cfg=1.0
ControlNet has no effect	Wrong base model	SDXL ControlNet on SDXL only, etc.
LoRA doesn't trigger	Missing trigger words	Check Civitai page for prompt tokens
Video sequences flicker	No frame consistency	Enable `fp16_attention` in Wan / Hunyuan nodes
Crashes on large images	Tile VAE not enabled	Add Tile VAE Decode node
AMD users: Vulkan slow	Use ROCm	See AMD ROCm Setup

FAQ {#faq}

See answers to common ComfyUI questions below.

Related guides on Local AI Master:

ComfyUI Complete Guide (2026): Install, Workflows, ControlNet, Flux, SDXL

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

What ComfyUI Is and Why It Wins {#what-is}

Hardware Requirements {#requirements}

Installation: Windows, Linux, macOS {#installation}

Windows (portable — recommended)

Linux / Mac (git)

Docker

Useful launch flags

Reading articles is good. Building is better.

Folder Layout & Where Models Go {#folder-layout}

ComfyUI Manager — Mandatory First Install {#manager}

The Node Graph Mental Model {#node-graph}

Your First Workflow: SDXL Text-to-Image {#first-workflow}

Models in 2026: Flux, SDXL, SD 3.5, Pony, Illustrious {#models}

Flux GGUF / FP8 quantized

LoRAs and Embeddings {#loras}

LoRA chains

Embeddings (textual inversion)

Best LoRA sources in 2026

ControlNet — Composition, Pose, Depth, Canny {#controlnet}

Pattern

Common preprocessors

SDXL vs Flux ControlNet

IPAdapter — Image Prompts and Style Transfer {#ipadapter}

IPAdapter strength

IPAdapter FaceID

Inpainting, Outpainting, and Upscaling {#inpainting}

Inpainting

Outpainting

Upscaling

Regional Prompting and Conditioning Combine {#regional}

Flux: Dev, Schnell, GGUF, Quantized {#flux}

Files needed

Workflow

Flux Schnell

Flux LoRA

Video Generation: Wan 2.2, HunyuanVideo, Mochi {#video}

Wan 2.2 (recommended starting point)

HunyuanVideo

Mochi

Frame interpolation and upscaling

API Mode and Programmatic Use {#api}

VRAM Optimization Tricks {#vram-optimization}

Speed Tuning: Sage Attention, Triton, Compile {#speed-tuning}

Sage Attention

Triton (Linux/WSL2)

torch.compile

TeaCache / FBCache

Common Custom Node Packs Worth Installing {#custom-nodes}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Flux Local Image Generation

Local AI Photographers

Best GPUs for AI 2025

AMD ROCm for Local LLMs

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI