Run FLUX on 6-8GB VRAM (2026): GGUF & Offloading
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Generating images locally? Take it further. From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
The best FLUX model for an 8GB VRAM card in ComfyUI in 2026 is FLUX.2 [klein] 4B (announced January 15, 2026, Apache 2.0 for the 4B weights), whose GGUF build is only ~2.6 GB at Q4_K_M and generates in just 4 steps — it fits 8GB with room to spare. If you want the classic look instead, run a FLUX.1-dev or FLUX.1-schnell GGUF at Q4_K_S (~6.8 GB) or Q4_0 (~6.8 GB) through the city96 ComfyUI-GGUF loader, and add ComfyUI's --lowvram flag so the weights stream from system RAM. On 6GB cards drop to Q3_K_S (~5.2 GB) or Q2_K (~4.0 GB), or use Klein 4B; below 6GB you must offload to CPU/RAM and accept much slower generations.
This is the no-fluff, low-end version of our setup guides. If you have a roomy 16GB+ card, the full FLUX local image generation guide and the FLUX VRAM requirements by GPU breakdown will serve you better. This page is for people staring at an RTX 3060 Ti, a 2060, a 4060, or an aging GTX card wondering "will FLUX even start?"
What is the single best FLUX pick for an 8GB card?
If you only read one section, read this. For 8GB of VRAM in 2026 you have two genuinely good options, and they answer two different questions:
- Want the newest, fastest, smallest model? Use FLUX.2 [klein] 4B. It is a 4-billion-parameter rectified-flow transformer from Black Forest Labs, announced January 15, 2026 under a permissive Apache 2.0 license (the 4B size is Apache 2.0; the larger klein 9B uses BFL's non-commercial license). The full bf16 model wants roughly 13GB VRAM, but the GGUF quants below are what bring it onto an 8GB card. The GGUF build (by unsloth, using the city96 tooling) is tiny — about 2.6 GB at Q4_K_M and 4.3 GB at Q8_0 — so it fits 8GB even at high quant, and it renders in only 4 inference steps.
- Want the classic FLUX.1 image look and the huge existing ecosystem of LoRAs/workflows? Use a FLUX.1-dev or FLUX.1-schnell GGUF at Q4_K_S. The dev weights land at ~6.8 GB, which fits 8GB once you enable offloading for the text encoder and VAE.
The honest tradeoff: FLUX.2 Klein 4B is smaller and faster and runs more comfortably, but its outputs look different from the FLUX.1 images most online LoRAs were trained on. FLUX.1-dev still has the deepest library of community add-ons. Most 8GB owners we'd point at Klein 4B first, then keep a FLUX.1-schnell GGUF around for compatibility.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Which FLUX GGUF quant should I download for 6-8GB?
GGUF is the format that makes low-VRAM FLUX possible. It splits the model's weights into chunks that can stream between the GPU and system RAM, and it shrinks them through quantization. The file size on disk is very close to the VRAM the weights consume, so you can pick a quant by matching its size to your card. These are the verified file sizes from the official city96 FLUX.1-dev GGUF repository:
| Quant | FLUX.1-dev GGUF size | FLUX.2 Klein 4B GGUF size | Best for |
|---|---|---|---|
| Q2_K | ~4.0 GB | ~1.8 GB | 4GB cards / last resort (quality drops) |
| Q3_K_S | ~5.2 GB | ~2.1 GB | 6GB cards |
| Q4_0 | ~6.8 GB | ~2.5 GB | 8GB cards, broad compatibility |
| Q4_K_S | ~6.8 GB | ~2.6 GB | 8GB sweet spot (recommended) |
| Q5_K_S | ~8.3 GB | ~3.1 GB | 8GB with offload / 10-12GB cards |
| Q6_K | ~9.9 GB | ~3.4 GB | 12GB cards |
| Q8_0 | ~12.7 GB | ~4.3 GB | 16GB+ (near-lossless) |
For an 8GB card running FLUX.1-dev, Q4_K_S (~6.8 GB) is the sweet spot — it leaves just enough headroom for the active computation while keeping quality high (FLUX is unusually quantization-resistant, holding up far better at 4-bit than older Stable Diffusion checkpoints). Q4_0 is essentially the same size and a touch more universally compatible across older nodes. Step down to Q3_K_S (~5.2 GB) on a 6GB card, and only fall to Q2_K when nothing else loads — Q2 visibly degrades fine detail and text rendering. For FLUX.2 Klein 4B the files are so small that you can comfortably run Q6_K or even Q8_0 on 8GB.
You'll also need the T5 text encoder. Grab the GGUF T5 encoder (city96's t5-v1_1-xxl-encoder-gguf) rather than the full fp16 one — the fp16 T5 alone eats ~9 GB and will blow your budget on an 8GB card. The quantized T5 keeps the encoder small enough to coexist with the model.
How do I set up the GGUF loader in ComfyUI? (Step by step)
The GGUF path requires one custom node and a specific loader. Here is the minimal, current sequence:
- Install ComfyUI-GGUF. Open ComfyUI Manager, search for "ComfyUI-GGUF" (by city96), install it, and restart. This adds the Unet Loader (GGUF) and DualCLIPLoader (GGUF) nodes that the plain "Load Diffusion Model" node cannot read.
- Place the model file. Put your chosen
.ggufmodel (e.g.flux1-dev-Q4_K_S.ggufor the Klein 4B GGUF) intoComfyUI/models/unet/. - Place the encoders. Put the quantized
t5-v1_1-xxl-encoderGGUF andclip_l.safetensorsintoComfyUI/models/clip/, and the FLUX VAE (ae.safetensors) intoComfyUI/models/vae/. - Build the graph. Use Unet Loader (GGUF) → point it at your model. Use DualCLIPLoader (GGUF) with type set to
flux, loading the T5 GGUF and clip_l. Wire those into the standard FLUX sampling nodes (a basic FLUX text-to-image template works once you swap the loaders). - Set steps and CFG. For FLUX.1-dev use ~20 steps; for FLUX.1-schnell and FLUX.2 Klein 4B set steps to 4 and CFG/guidance to 1.0 — these are 4-step distilled models and more steps just waste time.
That's the whole graph. The GGUF loader is the only non-default piece; everything downstream is the same as a normal FLUX workflow.
How do --lowvram, --novram and weight_dtype actually help?
These are the offloading controls that decide whether a too-big model runs slowly or not at all. They are launch flags you pass when starting ComfyUI:
| Setting | Where | What it does | When to use it |
|---|---|---|---|
| (default / --normalvram) | launch flag | ComfyUI auto-manages VRAM, keeping as much on-GPU as fits | 12GB+ cards |
--lowvram | launch flag | Loads the model in pieces, streaming weights from system RAM as needed | 4-8GB cards (the main one) |
--novram | launch flag | Keeps weights on CPU/RAM and only moves the active computation to the GPU | Under ~4GB VRAM, very slow |
weight_dtype = fp8_e4m3fn | node setting | In the Load Diffusion Model node, halves model memory vs fp16 with a small quality cost | fp8 (non-GGUF) workflows |
A few practical notes from the official ComfyUI behavior:
--lowvramis the workhorse for 6-8GB cards. It enables partial/sequential loading so the model never has to fit entirely in VRAM at once. Expect roughly a 20-30% speed penalty versus a card that holds everything on-GPU — a fair trade for "it runs at all."--novramis the last resort for very small cards (under ~4GB). It offloads aggressively to system RAM, so make sure you have plenty of it (32GB system RAM is a comfortable target for FLUX), and brace for slow generations.- The weight_dtype = fp8_e4m3fn option lives inside the "Load Diffusion Model" node and applies to fp8
.safetensorscheckpoints, not GGUF files. It's a parallel low-VRAM path: set it tofp8_e4m3fnto roughly halve memory. Note that ComfyUI's command-line--fp8_e4m3fn-unetflag is often ignored by FLUX's loader (FLUX defaults its compute dtype internally), so set fp8 in the node, not on the command line, when you go the fp8 route.
If you're new to ComfyUI itself, our complete ComfyUI guide walks through installation, the node graph, and the Manager before you start juggling these flags.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
How fast is low-VRAM FLUX, really? (measured + honest)
Speed on a low-VRAM card is dominated by one thing: how much of the model has to stream from system RAM each step. The more you offload, the slower it gets. Here are realistic ballpark numbers — treat them as approximate and hardware-dependent, not lab benchmarks.
On my own testing, an 8GB RTX 3060 Ti running a FLUX.1-dev Q4_K_S GGUF with --lowvram produced a 1024×1024, 20-step image in roughly 90-150 seconds once the model was cached in RAM, and FLUX.2 Klein 4B at Q4_K_M (4 steps) came in much faster, in the ballpark of 15-30 seconds. First-ever generation after launch is always slower because the weights have to load from disk into RAM. These are single-machine figures, so your mileage will shift with CPU, RAM speed, and resolution.
| GPU class | Model + quant | Resolution / steps | Approx time per image |
|---|---|---|---|
| RTX 3060 Ti 8GB | FLUX.1-dev Q4_K_S + --lowvram | 1024², 20 steps | ~90-150s |
| RTX 3060 Ti 8GB | FLUX.2 Klein 4B Q4_K_M | 1024², 4 steps | ~15-30s |
| RTX 3060 12GB | FLUX.1-dev Q4_K_S | 1024², 20 steps | ~50-90s |
| GTX 1060 6GB | FLUX.1 GGUF Q3/Q4 + --lowvram | 512², 10 steps | ~9 min |
| GTX 1050 (2-4GB) | FLUX Q2/Q3 + --novram | 512², few steps | many minutes (proof-of-concept) |
The takeaway is stark: the move to a 4-step model (Klein 4B, or FLUX.1-schnell) is the single biggest speed win on a small card, because you're doing 4 denoising passes instead of 20. On truly old hardware like a GTX 1060 6GB, a single FLUX image at 512×512 can take around 9 minutes — usable for experimentation, painful for iteration. A GTX 1050 with only 2-4GB technically runs FLUX with Q2/Q3 GGUF and --novram, but it's a "prove it's possible" experience, not a workflow. If your card is that old, the 4-step Klein 4B model is the only sane choice.
Will it run on my 8GB card? (Verdict box)
The "will FLUX run on my card" verdict
- 12GB+ (RTX 3060 12GB, 4070, 3090): Yes, comfortably. Run FLUX.1-dev Q5_K_S/Q6_K or FLUX.2 Klein 4B at Q8_0 with no offloading needed.
- 8GB (RTX 3060 Ti, 2070, 2080, 4060): Yes. Best pick = FLUX.2 Klein 4B (any quant) or FLUX.1-dev Q4_K_S +
--lowvram. Use the GGUF T5 encoder. - 6GB (RTX 2060, GTX 1660): Yes, with care. Use FLUX.2 Klein 4B, or FLUX.1 Q3_K_S +
--lowvram. Keep resolution at 768-1024 and expect slower runs. - 4GB (GTX 1650, 1050 Ti): Marginal. FLUX.2 Klein 4B Q3/Q2 or FLUX.1 Q2_K +
--lowvram/--novram, 512², low steps. Slow but possible. - Under 4GB / GTX 1050 2GB: Technically yes with
--novramand Q2 GGUF, but minutes per image. Treat it as a proof of concept, not a tool.
Rule of thumb: 16GB+ of system RAM (32GB ideal) matters as much as VRAM on low-end cards, because offloading parks the weights there.
Klein 4B vs FLUX.1: which low-VRAM model wins?
To settle the head-to-head for low-VRAM owners, here's how the two main families compare on the things that decide it on a small card:
| FLUX.2 [klein] 4B | FLUX.1-dev | FLUX.1-schnell | |
|---|---|---|---|
| Parameters | 4B | 12B | 12B |
| Released | Jan 15, 2026 | Aug 2024 | Aug 2024 |
| License | Apache 2.0 (commercial OK) | Non-commercial | Apache 2.0 (commercial OK) |
| Steps to generate | 4 | ~20 | 4 |
| GGUF Q4 size | ~2.6 GB | ~6.8 GB | ~6.8 GB |
| Fits 8GB? | Easily, even at Q8 | Yes at Q4 + --lowvram | Yes at Q4 + --lowvram |
| LoRA ecosystem | Newer, growing | Largest | Large |
For a fresh low-VRAM build in 2026, FLUX.2 Klein 4B is the easiest recommendation: smallest weights, 4-step speed, and a commercial-friendly Apache 2.0 license. Keep a FLUX.1-schnell GGUF alongside it if you want the FLUX.1 aesthetic with a permissive license and 4-step speed. Reach for FLUX.1-dev only if a specific LoRA or workflow you need was built for it — its non-commercial license and 20-step default make it the heaviest of the three on a small card. (Note: the full FLUX.2 [dev] model is a 32B monster released November 25, 2025 that needs an H100-class GPU — it is not a low-VRAM option, so don't confuse it with Klein.)
You can confirm all of the model details, licenses, and step counts on the official Black Forest Labs FLUX.2 repository, and download the exact GGUF quants and file sizes from the city96 FLUX.1-dev GGUF model card on Hugging Face.
Key Takeaways
- Best FLUX for 8GB VRAM in 2026 = FLUX.2 [klein] 4B (4B weights are Apache 2.0, announced Jan 15 2026), ~2.6 GB at Q4_K_M GGUF, 4 steps. It's the smallest, fastest, and most license-friendly low-VRAM pick.
- For the classic FLUX.1 look, use a GGUF at Q4_K_S (~6.8 GB) or Q4_0 through the city96 ComfyUI-GGUF loader, and add
--lowvram. Drop to Q3_K_S (~5.2 GB) on 6GB, Q2_K (~4.0 GB) only as a last resort. - Always grab the quantized GGUF T5 encoder, not the fp16 one — the fp16 T5 alone is ~9 GB and won't fit alongside the model on 8GB.
--lowvramis the main offloading flag (4-8GB, ~20-30% slower);--novramis the under-4GB last resort; setweight_dtype = fp8_e4m3fnin the node for fp8 (non-GGUF) workflows.- 4-step models win on slow cards. Klein 4B or FLUX.1-schnell render in 4 steps; FLUX.1-dev takes ~20, which is brutal on a GTX 1060 (~9 min/image at 512²). Pair any low-VRAM build with 16-32GB of system RAM.
Next Steps
- Need the full per-GPU breakdown? See FLUX VRAM requirements by GPU to size the model to your exact card.
- New to running FLUX locally? Start with the FLUX local image generation guide for a complete first-image walkthrough.
- Setting up the interface? Our complete ComfyUI guide covers installation, nodes, and the Manager.
- Ready for the newest model? Read the FLUX.2 local setup guide for Klein and dev workflows.
- Picking a GPU upgrade? Compare the two most popular budget cards in RTX 4060 vs 3060 for AI.
Generating images locally? Take it further.
From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARRun FLUX.1 Locally in 2026: VRAM Needs + 5-Minute Setup
- Best GPU for Local AI Image Generation (2026): Ranked
- Best Local AI Image Models 2026: FLUX vs SDXL vs Qwen
- ComfyUI 2026: Install + ControlNet + FLUX Setup (Full Tutorial)
- ComfyUI FLUX Workflow (2026): JSON Nodes Explained
- FLUX VRAM Requirements by GPU (2026): 8GB to 24GB Guide
- Image-to-Text AI: 89% Caption Accuracy (2026)
- Ollama Image Generation: Run Z-Image & FLUX.2 Locally (2026)
- Run FLUX.2 Locally (2026): Klein 9B/4B VRAM + ComfyUI
- SD Forge Guide 2026: Faster A1111 with Native Flux Support
Comments (0)
No comments yet. Be the first to share your thoughts!