Stable Diffusion Local Install (2026): VRAM, Setup, Models
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Generating images locally? Take it further. From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
To run Stable Diffusion locally in 2026 you need an NVIDIA GPU with at least 8 GB of VRAM for SDXL (12 GB is the comfortable floor); SD 1.5 runs on as little as 4 GB, while SD 3.5 Large (8.1B params) wants ~11 GB even with NVIDIA's FP8 build. Pick a checkpoint family (SDXL is still the best all-round choice in 2026 thanks to its enormous LoRA and ControlNet ecosystem), grab the weights from Hugging Face or CivitAI, and run them through a front end like Forge, Fooocus, or ComfyUI. This page is the install-and-hardware hub; the step-by-step UI walkthroughs live in their own guides linked below.
If you only remember one thing: VRAM, not raw GPU speed, decides what you can run. An 8 GB card opens SDXL; 12 GB makes SDXL comfortable and brings SD 3.5 into reach; 16 GB+ removes nearly every limit on the consumer side.
Can my GPU run Stable Diffusion? (Verdict)
Here is the blunt answer by VRAM tier. These are practical, generation-time figures for 1024x1024 (SDXL/SD 3.5) or 512x512 (SD 1.5), not training.
| Your VRAM | What you can run | Verdict |
|---|---|---|
| 4 GB | SD 1.5 only (512x512), slow | ⚠️ Minimum — works, but tight |
| 6 GB | SD 1.5 comfortably; SDXL with offload tricks | ⚠️ SD 1.5 tier |
| 8 GB | SDXL at 1024x1024; SD 3.5 Medium | ✅ The real entry point |
| 12 GB | SDXL comfortably + SD 3.5 Large (FP8) | ✅ Sweet spot |
| 16 GB | Everything above, faster, bigger batches | ✅ Comfortable |
| 24 GB+ | SDXL/SD 3.5 with headroom, light training | ✅ No limits (consumer) |
Short version: 8 GB is the honest entry point for modern image work, 12 GB is the sweet spot, and 4-6 GB confines you to SD 1.5. AMD and Apple Silicon can run Stable Diffusion too, but with caveats covered further down. Not sure how your exact card maps? Run it through our "Can I run local AI?" checker before you download gigabytes of weights.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
SD 1.5 vs SDXL vs SD 3.5: which checkpoint family?
Stability AI has shipped three major open generations that people still run locally. They are not strictly "newer is better" — each trades VRAM for quality and ecosystem depth, and the parameter counts below explain why bigger models need more memory.
| Model | Parameters | Released | Native resolution | Practical VRAM (local) |
|---|---|---|---|---|
| SD 1.5 | ~0.98B | Oct 2022 | 512x512 | ~4 GB min |
| SDXL 1.0 | ~3.5B | Jul 2023 | 1024x1024 | 8 GB min / 12 GB comfortable |
| SD 3.5 Medium | 2.5B | Oct 29, 2024 | up to ~1440px | ~9.9 GB (excl. text encoders) |
| SD 3.5 Large | 8.1B | Oct 22, 2024 | 1 MP (1024x1024) | ~18 GB FP16; ~11 GB FP8 |
A few honest notes on these numbers:
- SD 1.5 is ancient by AI standards (2022) but refuses to die. Its 512x512 base and tiny ~1 GB-class weights mean it runs on almost anything, and its decade-deep library of fine-tunes and LoRAs keeps it relevant for stylized work on weak GPUs.
- SDXL 1.0 roughly tripled the parameters to ~3.5B and moved to native 1024x1024. This is the model most local creators still build around in 2026.
- SD 3.5 Large is the highest-quality open Stability model at 8.1B parameters, but at FP16 it needs around 18 GB of VRAM. NVIDIA and Stability shipped an FP8 build that cuts that requirement by roughly 40% to about 11 GB, which is what makes it viable on a 12 GB card.
- SD 3.5 Medium (2.5B) is the consumer-targeted variant — Stability quotes ~9.9 GB of VRAM excluding text encoders, so plan for a bit more in practice.
All four are distributed for download from Stability AI's Hugging Face organization. SD 3.5 ships under the Stability AI Community License (free for research and for commercial use under $1M annual revenue).
Why does SDXL still win in 2026?
Newer does not mean dominant. Even with FLUX and SD 3.5 available, SDXL remains the default for most serious local workflows, and the reason is ecosystem, not base-model quality:
- The LoRA library is enormous. CivitAI hosts thousands of SDXL style, character, and concept LoRAs — the deepest catalog of any open image model. If you want a specific art style, character, or look, the odds it already exists as an SDXL LoRA are high.
- The ControlNet and inpainting stack is the most mature. SDXL has well-supported ControlNet models (pose, depth, edges, etc.), plus battle-tested inpainting, outpainting, and img2img workflows that "just work" across Forge, ComfyUI, and A1111.
- It fits 8 GB. SDXL runs at native 1024x1024 on an 8 GB card, where SD 3.5 Large and FLUX push you toward 12 GB+ or aggressive quantization.
FLUX produces cleaner prompt-following and better photorealism out of the box, and for some users it is the better default in 2026 — but its LoRA/ControlNet tooling is still catching up to SDXL's. If you depend on specific community fine-tunes, SDXL's ecosystem is currently deeper. For the FLUX side of that trade-off, including its VRAM tiers, see our guide on running FLUX.1 locally.
Where do I get checkpoints and models?
Two sources cover virtually everything:
- Hugging Face — the official, canonical source for base weights (SD 1.5, SDXL, SD 3.5). Use this for the original Stability models and reputable community releases. Files are usually
.safetensors. - CivitAI — the community hub for fine-tuned checkpoints, LoRAs, ControlNet models, embeddings, and VAEs. This is where the thousands of SDXL community models live. Filter by base model (SD 1.5 / SDXL / SD 3.5) so you download a LoRA that matches your checkpoint — an SDXL LoRA will not load against an SD 1.5 checkpoint.
Always prefer .safetensors over .ckpt (the older pickle format can execute arbitrary code on load). Drop checkpoints into your UI's models folder (for example, models/Stable-diffusion in A1111/Forge, or models/checkpoints in ComfyUI), and put LoRAs in the matching Lora / loras folder.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How do I actually install it? (Setup paths)
There is no single "Stable Diffusion installer" — you install a front end (UI) that loads the checkpoints. Pick one based on how much control you want:
| Front end | Best for | Difficulty |
|---|---|---|
| Fooocus | Beginners — Midjourney-style, near-zero config | Easiest |
| Forge | A1111 users who want speed + native FLUX | Easy-Medium |
| AUTOMATIC1111 | The classic full-featured WebUI, max extensions | Medium |
| ComfyUI | Power users — node graphs, full pipeline control | Advanced |
Rather than duplicate hundreds of lines of click-by-click steps here, follow the dedicated walkthrough for whichever UI you choose — each is kept current with 2026 versions:
- Fooocus guide — the easiest path; great if you just want good SDXL images with almost no setup.
- SD Forge guide — a faster A1111 fork with native FLUX support; the pragmatic middle ground.
- AUTOMATIC1111 guide — the original, extension-rich WebUI most tutorials assume.
- ComfyUI guide — node-based, the most flexible front end and the reference UI for SD 3.5 and FLUX.
The common flow is the same regardless of UI: install Python + Git, clone or download the UI, drop a checkpoint into its models folder, and launch. On NVIDIA this is the smoothest; AMD and Apple need the extra steps below.
Best Stable Diffusion models per VRAM tier
If you are choosing a checkpoint to match your card, this is the practical mapping. "Comfortable" means 1024x1024 (or 512x512 for SD 1.5) without offload hacks.
| Your GPU VRAM | Recommended checkpoint family | Why |
|---|---|---|
| 4-6 GB | SD 1.5 (and SD 1.5 fine-tunes) | Only family that fits comfortably; huge LoRA library |
| 8 GB | SDXL 1.0 (+ SDXL fine-tunes) | Native 1024x1024 fits; deepest ControlNet/LoRA stack |
| 12 GB | SDXL + SD 3.5 Large (FP8) | SDXL with headroom; SD 3.5 Large becomes viable at FP8 (~11 GB) |
| 16 GB+ | Any: SDXL, SD 3.5 Large, FLUX | Room for big batches, high res, and quantized FLUX |
For an 8 GB card the answer is almost always SDXL — it is the highest-quality family that fits at full resolution and has the richest tooling. For 12 GB, SDXL stays the workhorse but SD 3.5 Large in FP8 is worth keeping around for its prompt fidelity. To size a specific quant against your exact card and context, our VRAM requirements guide breaks down the math.
NVIDIA vs AMD vs Apple Silicon
Stable Diffusion runs on all three, but the experience is not equal:
- NVIDIA (recommended): CUDA is the native target. Everything — Forge, ComfyUI, A1111, every extension — is built and tested for it first. An RTX-class card with 8-12 GB is the path of least resistance and the fastest per dollar.
- AMD: Works, with caveats. On Linux with ROCm properly configured, AMD cards reach near-CUDA performance. On Windows you are on DirectML or ZLUDA (community projects such as the AMD-GPU A1111/Forge forks), where performance typically runs roughly 30-50% behind an equivalent NVIDIA card. Doable, but expect setup friction.
- Apple Silicon (M-series): Runs via the Metal Performance Shaders (MPS) backend. SDXL works on a 16 GB-plus Mac, and native apps like Draw Things use Apple's Metal engine for a smoother experience — roughly an SDXL 1024x1024 image in the ~25-40 second range on an M2 Pro per community reports. Usable and convenient, but a Mac generates noticeably slower than a comparable CUDA GPU, so it is a good "I already own one" option rather than a machine to buy specifically for image generation.
First-hand notes from a local box
Take these as approximate, single-machine observations, not a controlled benchmark. On an RTX 3060 12 GB with Forge, SDXL at 1024x1024 / ~25 steps lands roughly in the 12-20 second per image range once the model is loaded, and SD 1.5 at 512x512 is near-instant by comparison. SD 3.5 Large in its FP8 build fit on the same 12 GB card but ran appreciably slower and left far less headroom for big batches. The pattern matches the rule above: VRAM determines what loads at all, and once a model fully fits in VRAM, generation is fast — the moment it spills to system RAM, speed collapses. If your generations suddenly crawl, you have almost certainly run out of VRAM and the UI is offloading.
Key Takeaways
- 8 GB of NVIDIA VRAM is the honest entry point for modern Stable Diffusion (SDXL). SD 1.5 runs on 4 GB; 12 GB is the comfortable sweet spot and unlocks SD 3.5 Large in FP8.
- SDXL (3.5B, 2023) still wins for most local workflows in 2026 — not because the base is best, but because its LoRA, ControlNet, and inpainting ecosystem is the deepest of any open model.
- SD 3.5 Large (8.1B) needs ~18 GB at FP16, ~11 GB with the FP8 build; SD 3.5 Medium (2.5B) targets consumer GPUs at ~9.9 GB excluding text encoders.
- Get base weights from Hugging Face, community fine-tunes/LoRAs from CivitAI, and always prefer
.safetensors— matching the LoRA's base model to your checkpoint. - You install a front end, not "Stable Diffusion" itself — Fooocus (easiest), Forge (fast), A1111 (classic), or ComfyUI (most flexible). NVIDIA is smoothest; AMD and Apple work with extra setup and slower speeds.
Next Steps
- Brand new and want images fast? Start with the Fooocus guide — the lowest-effort path to good SDXL output.
- Want speed plus FLUX support in a familiar A1111 layout? Follow the SD Forge guide.
- Prefer the classic, extension-heavy WebUI? Use the AUTOMATIC1111 guide.
- Ready for full pipeline control or SD 3.5 / FLUX? See the ComfyUI guide.
- Considering FLUX as your default instead of SDXL? Read Run FLUX.1 locally for its VRAM tiers and trade-offs.
- Not sure your GPU is enough? Check the VRAM requirements guide or the can-I-run checker before downloading.
Generating images locally? Take it further.
From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARRun FLUX.1 Locally in 2026: VRAM Needs + 5-Minute Setup
- Best GPU for Local AI Image Generation (2026): Ranked
- Best Local AI Image Models 2026: FLUX vs SDXL vs Qwen
- ComfyUI 2026: Install + ControlNet + FLUX Setup (Full Tutorial)
- ComfyUI FLUX Workflow (2026): JSON Nodes Explained
- FLUX VRAM Requirements by GPU (2026): 8GB to 24GB Guide
- Image-to-Text AI: 89% Caption Accuracy (2026)
- Ollama Image Generation: Run Z-Image & FLUX.2 Locally (2026)
- Run FLUX on 6-8GB VRAM (2026): GGUF & Offloading
- Run FLUX.2 Locally (2026): Klein 9B/4B VRAM + ComfyUI
Comments (0)
No comments yet. Be the first to share your thoughts!