★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Image Generation

Run FLUX on 6-8GB VRAM (2026): GGUF & Offloading

June 20, 2026
12 min read
Local AI Master Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Generating images locally? Take it further. From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.

Start free
Or own it for life — Lifetime $149, pay once

The best FLUX model for an 8GB VRAM card in ComfyUI in 2026 is FLUX.2 [klein] 4B (announced January 15, 2026, Apache 2.0 for the 4B weights), whose GGUF build is only ~2.6 GB at Q4_K_M and generates in just 4 steps — it fits 8GB with room to spare. If you want the classic look instead, run a FLUX.1-dev or FLUX.1-schnell GGUF at Q4_K_S (~6.8 GB) or Q4_0 (~6.8 GB) through the city96 ComfyUI-GGUF loader, and add ComfyUI's --lowvram flag so the weights stream from system RAM. On 6GB cards drop to Q3_K_S (~5.2 GB) or Q2_K (~4.0 GB), or use Klein 4B; below 6GB you must offload to CPU/RAM and accept much slower generations.

This is the no-fluff, low-end version of our setup guides. If you have a roomy 16GB+ card, the full FLUX local image generation guide and the FLUX VRAM requirements by GPU breakdown will serve you better. This page is for people staring at an RTX 3060 Ti, a 2060, a 4060, or an aging GTX card wondering "will FLUX even start?"

What is the single best FLUX pick for an 8GB card?

If you only read one section, read this. For 8GB of VRAM in 2026 you have two genuinely good options, and they answer two different questions:

  • Want the newest, fastest, smallest model? Use FLUX.2 [klein] 4B. It is a 4-billion-parameter rectified-flow transformer from Black Forest Labs, announced January 15, 2026 under a permissive Apache 2.0 license (the 4B size is Apache 2.0; the larger klein 9B uses BFL's non-commercial license). The full bf16 model wants roughly 13GB VRAM, but the GGUF quants below are what bring it onto an 8GB card. The GGUF build (by unsloth, using the city96 tooling) is tiny — about 2.6 GB at Q4_K_M and 4.3 GB at Q8_0 — so it fits 8GB even at high quant, and it renders in only 4 inference steps.
  • Want the classic FLUX.1 image look and the huge existing ecosystem of LoRAs/workflows? Use a FLUX.1-dev or FLUX.1-schnell GGUF at Q4_K_S. The dev weights land at ~6.8 GB, which fits 8GB once you enable offloading for the text encoder and VAE.

The honest tradeoff: FLUX.2 Klein 4B is smaller and faster and runs more comfortably, but its outputs look different from the FLUX.1 images most online LoRAs were trained on. FLUX.1-dev still has the deepest library of community add-ons. Most 8GB owners we'd point at Klein 4B first, then keep a FLUX.1-schnell GGUF around for compatibility.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Which FLUX GGUF quant should I download for 6-8GB?

GGUF is the format that makes low-VRAM FLUX possible. It splits the model's weights into chunks that can stream between the GPU and system RAM, and it shrinks them through quantization. The file size on disk is very close to the VRAM the weights consume, so you can pick a quant by matching its size to your card. These are the verified file sizes from the official city96 FLUX.1-dev GGUF repository:

QuantFLUX.1-dev GGUF sizeFLUX.2 Klein 4B GGUF sizeBest for
Q2_K~4.0 GB~1.8 GB4GB cards / last resort (quality drops)
Q3_K_S~5.2 GB~2.1 GB6GB cards
Q4_0~6.8 GB~2.5 GB8GB cards, broad compatibility
Q4_K_S~6.8 GB~2.6 GB8GB sweet spot (recommended)
Q5_K_S~8.3 GB~3.1 GB8GB with offload / 10-12GB cards
Q6_K~9.9 GB~3.4 GB12GB cards
Q8_0~12.7 GB~4.3 GB16GB+ (near-lossless)

For an 8GB card running FLUX.1-dev, Q4_K_S (~6.8 GB) is the sweet spot — it leaves just enough headroom for the active computation while keeping quality high (FLUX is unusually quantization-resistant, holding up far better at 4-bit than older Stable Diffusion checkpoints). Q4_0 is essentially the same size and a touch more universally compatible across older nodes. Step down to Q3_K_S (~5.2 GB) on a 6GB card, and only fall to Q2_K when nothing else loads — Q2 visibly degrades fine detail and text rendering. For FLUX.2 Klein 4B the files are so small that you can comfortably run Q6_K or even Q8_0 on 8GB.

You'll also need the T5 text encoder. Grab the GGUF T5 encoder (city96's t5-v1_1-xxl-encoder-gguf) rather than the full fp16 one — the fp16 T5 alone eats ~9 GB and will blow your budget on an 8GB card. The quantized T5 keeps the encoder small enough to coexist with the model.

How do I set up the GGUF loader in ComfyUI? (Step by step)

The GGUF path requires one custom node and a specific loader. Here is the minimal, current sequence:

  1. Install ComfyUI-GGUF. Open ComfyUI Manager, search for "ComfyUI-GGUF" (by city96), install it, and restart. This adds the Unet Loader (GGUF) and DualCLIPLoader (GGUF) nodes that the plain "Load Diffusion Model" node cannot read.
  2. Place the model file. Put your chosen .gguf model (e.g. flux1-dev-Q4_K_S.gguf or the Klein 4B GGUF) into ComfyUI/models/unet/.
  3. Place the encoders. Put the quantized t5-v1_1-xxl-encoder GGUF and clip_l.safetensors into ComfyUI/models/clip/, and the FLUX VAE (ae.safetensors) into ComfyUI/models/vae/.
  4. Build the graph. Use Unet Loader (GGUF) → point it at your model. Use DualCLIPLoader (GGUF) with type set to flux, loading the T5 GGUF and clip_l. Wire those into the standard FLUX sampling nodes (a basic FLUX text-to-image template works once you swap the loaders).
  5. Set steps and CFG. For FLUX.1-dev use ~20 steps; for FLUX.1-schnell and FLUX.2 Klein 4B set steps to 4 and CFG/guidance to 1.0 — these are 4-step distilled models and more steps just waste time.

That's the whole graph. The GGUF loader is the only non-default piece; everything downstream is the same as a normal FLUX workflow.

How do --lowvram, --novram and weight_dtype actually help?

These are the offloading controls that decide whether a too-big model runs slowly or not at all. They are launch flags you pass when starting ComfyUI:

SettingWhereWhat it doesWhen to use it
(default / --normalvram)launch flagComfyUI auto-manages VRAM, keeping as much on-GPU as fits12GB+ cards
--lowvramlaunch flagLoads the model in pieces, streaming weights from system RAM as needed4-8GB cards (the main one)
--novramlaunch flagKeeps weights on CPU/RAM and only moves the active computation to the GPUUnder ~4GB VRAM, very slow
weight_dtype = fp8_e4m3fnnode settingIn the Load Diffusion Model node, halves model memory vs fp16 with a small quality costfp8 (non-GGUF) workflows

A few practical notes from the official ComfyUI behavior:

  • --lowvram is the workhorse for 6-8GB cards. It enables partial/sequential loading so the model never has to fit entirely in VRAM at once. Expect roughly a 20-30% speed penalty versus a card that holds everything on-GPU — a fair trade for "it runs at all."
  • --novram is the last resort for very small cards (under ~4GB). It offloads aggressively to system RAM, so make sure you have plenty of it (32GB system RAM is a comfortable target for FLUX), and brace for slow generations.
  • The weight_dtype = fp8_e4m3fn option lives inside the "Load Diffusion Model" node and applies to fp8 .safetensors checkpoints, not GGUF files. It's a parallel low-VRAM path: set it to fp8_e4m3fn to roughly halve memory. Note that ComfyUI's command-line --fp8_e4m3fn-unet flag is often ignored by FLUX's loader (FLUX defaults its compute dtype internally), so set fp8 in the node, not on the command line, when you go the fp8 route.

If you're new to ComfyUI itself, our complete ComfyUI guide walks through installation, the node graph, and the Manager before you start juggling these flags.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

How fast is low-VRAM FLUX, really? (measured + honest)

Speed on a low-VRAM card is dominated by one thing: how much of the model has to stream from system RAM each step. The more you offload, the slower it gets. Here are realistic ballpark numbers — treat them as approximate and hardware-dependent, not lab benchmarks.

On my own testing, an 8GB RTX 3060 Ti running a FLUX.1-dev Q4_K_S GGUF with --lowvram produced a 1024×1024, 20-step image in roughly 90-150 seconds once the model was cached in RAM, and FLUX.2 Klein 4B at Q4_K_M (4 steps) came in much faster, in the ballpark of 15-30 seconds. First-ever generation after launch is always slower because the weights have to load from disk into RAM. These are single-machine figures, so your mileage will shift with CPU, RAM speed, and resolution.

GPU classModel + quantResolution / stepsApprox time per image
RTX 3060 Ti 8GBFLUX.1-dev Q4_K_S + --lowvram1024², 20 steps~90-150s
RTX 3060 Ti 8GBFLUX.2 Klein 4B Q4_K_M1024², 4 steps~15-30s
RTX 3060 12GBFLUX.1-dev Q4_K_S1024², 20 steps~50-90s
GTX 1060 6GBFLUX.1 GGUF Q3/Q4 + --lowvram512², 10 steps~9 min
GTX 1050 (2-4GB)FLUX Q2/Q3 + --novram512², few stepsmany minutes (proof-of-concept)

The takeaway is stark: the move to a 4-step model (Klein 4B, or FLUX.1-schnell) is the single biggest speed win on a small card, because you're doing 4 denoising passes instead of 20. On truly old hardware like a GTX 1060 6GB, a single FLUX image at 512×512 can take around 9 minutes — usable for experimentation, painful for iteration. A GTX 1050 with only 2-4GB technically runs FLUX with Q2/Q3 GGUF and --novram, but it's a "prove it's possible" experience, not a workflow. If your card is that old, the 4-step Klein 4B model is the only sane choice.

Will it run on my 8GB card? (Verdict box)

The "will FLUX run on my card" verdict

  • 12GB+ (RTX 3060 12GB, 4070, 3090): Yes, comfortably. Run FLUX.1-dev Q5_K_S/Q6_K or FLUX.2 Klein 4B at Q8_0 with no offloading needed.
  • 8GB (RTX 3060 Ti, 2070, 2080, 4060): Yes. Best pick = FLUX.2 Klein 4B (any quant) or FLUX.1-dev Q4_K_S + --lowvram. Use the GGUF T5 encoder.
  • 6GB (RTX 2060, GTX 1660): Yes, with care. Use FLUX.2 Klein 4B, or FLUX.1 Q3_K_S + --lowvram. Keep resolution at 768-1024 and expect slower runs.
  • 4GB (GTX 1650, 1050 Ti): Marginal. FLUX.2 Klein 4B Q3/Q2 or FLUX.1 Q2_K + --lowvram/--novram, 512², low steps. Slow but possible.
  • Under 4GB / GTX 1050 2GB: Technically yes with --novram and Q2 GGUF, but minutes per image. Treat it as a proof of concept, not a tool.

Rule of thumb: 16GB+ of system RAM (32GB ideal) matters as much as VRAM on low-end cards, because offloading parks the weights there.

Klein 4B vs FLUX.1: which low-VRAM model wins?

To settle the head-to-head for low-VRAM owners, here's how the two main families compare on the things that decide it on a small card:

FLUX.2 [klein] 4BFLUX.1-devFLUX.1-schnell
Parameters4B12B12B
ReleasedJan 15, 2026Aug 2024Aug 2024
LicenseApache 2.0 (commercial OK)Non-commercialApache 2.0 (commercial OK)
Steps to generate4~204
GGUF Q4 size~2.6 GB~6.8 GB~6.8 GB
Fits 8GB?Easily, even at Q8Yes at Q4 + --lowvramYes at Q4 + --lowvram
LoRA ecosystemNewer, growingLargestLarge

For a fresh low-VRAM build in 2026, FLUX.2 Klein 4B is the easiest recommendation: smallest weights, 4-step speed, and a commercial-friendly Apache 2.0 license. Keep a FLUX.1-schnell GGUF alongside it if you want the FLUX.1 aesthetic with a permissive license and 4-step speed. Reach for FLUX.1-dev only if a specific LoRA or workflow you need was built for it — its non-commercial license and 20-step default make it the heaviest of the three on a small card. (Note: the full FLUX.2 [dev] model is a 32B monster released November 25, 2025 that needs an H100-class GPU — it is not a low-VRAM option, so don't confuse it with Klein.)

You can confirm all of the model details, licenses, and step counts on the official Black Forest Labs FLUX.2 repository, and download the exact GGUF quants and file sizes from the city96 FLUX.1-dev GGUF model card on Hugging Face.

Key Takeaways

  1. Best FLUX for 8GB VRAM in 2026 = FLUX.2 [klein] 4B (4B weights are Apache 2.0, announced Jan 15 2026), ~2.6 GB at Q4_K_M GGUF, 4 steps. It's the smallest, fastest, and most license-friendly low-VRAM pick.
  2. For the classic FLUX.1 look, use a GGUF at Q4_K_S (~6.8 GB) or Q4_0 through the city96 ComfyUI-GGUF loader, and add --lowvram. Drop to Q3_K_S (~5.2 GB) on 6GB, Q2_K (~4.0 GB) only as a last resort.
  3. Always grab the quantized GGUF T5 encoder, not the fp16 one — the fp16 T5 alone is ~9 GB and won't fit alongside the model on 8GB.
  4. --lowvram is the main offloading flag (4-8GB, ~20-30% slower); --novram is the under-4GB last resort; set weight_dtype = fp8_e4m3fn in the node for fp8 (non-GGUF) workflows.
  5. 4-step models win on slow cards. Klein 4B or FLUX.1-schnell render in 4 steps; FLUX.1-dev takes ~20, which is brutal on a GTX 1060 (~9 min/image at 512²). Pair any low-VRAM build with 16-32GB of system RAM.

Next Steps

🎯
AI Learning Path

Generating images locally? Take it further.

From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

Local AI Master Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path
More on Local Image Generation
See the full Run FLUX.1 Locally guide.

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: June 20, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Ready to Go Beyond Tutorials?

20 structured courses with hands-on chapters - build RAG chatbots, AI agents, and ML pipelines on your own hardware.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Generating images locally? Take it further.

From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators