FLUX.1 [dev] (a 12B-parameter model) needs roughly 24 GB of VRAM at full FP16, about 12 GB at FP8, and as little as 6-8 GB if you run a GGUF Q4 quant — which is why an 8GB RTX 3060/4060 can still run it. The single best file to download is keyed to your VRAM: Q4_K_S/Q4_0 GGUF for 8GB, Q5/Q6 GGUF for 12GB, FP8 (flux1-dev-fp8.safetensors, ~11.9 GB) for 16GB, and the full FP16 flux1-dev.safetensors (~23.8 GB) for 24GB. The newer FLUX.2 [dev] is a 32B model that needs a data-center GPU (more than a single 80GB H100 at full BF16, ~32 GB at FP8, ~19 GB only at Q4 GGUF with the text encoder offloaded), so for most local GPUs the right answer is still FLUX.1, or the new FLUX.2 [klein] 4B.

This guide gives you a definitive "what runs on my card" table, the exact GGUF/FP8 file to grab per VRAM tier with file sizes in GB, the ComfyUI low-VRAM flags that make it fit, and the generation-time tradeoffs you accept when you quantize down.

What is the best FLUX model for 8GB, 12GB, and 16GB VRAM?

Here is the direct verdict before the detail. These pair a FLUX variant with the specific quantized file most people actually run on each card:

Your VRAM	Best FLUX model + file	Why
8 GB (RTX 3060 Ti / 4060 / 3070)	FLUX.1 [dev] GGUF Q4 (flux1-dev-Q4_K_S.gguf, ~6.8 GB) + FP8 T5 encoder	Smallest dev quant that holds quality; needs --lowvram
12 GB (RTX 3060 12GB / 4070)	FLUX.1 [dev] GGUF Q5_K_S (~8.3 GB) or Q6_K (~9.9 GB)	Sweet spot — near-FP8 quality, fits with context
16 GB (RTX 4060 Ti / 5060 Ti / 4070 Ti S)	FLUX.1 [dev] FP8 (flux1-dev-fp8.safetensors, ~11.9 GB)	Nearly FP16 quality, half the VRAM, simple single file
24 GB (RTX 3090 / 4090 / 5090)	FLUX.1 [dev] FP16 (flux1-dev.safetensors, ~23.8 GB)	Maximum quality, full LoRA/ControlNet ecosystem
24 GB, newest model	FLUX.2 [klein] 4B (Apache 2.0, ~13 GB FP16)	Sub-second, runs on 12GB+, the only easy FLUX.2 locally

If you want speed over peak quality on any tier, FLUX.1 [schnell] (Apache 2.0, generates in 1-4 steps) uses the same VRAM as dev but finishes far faster. The catch is that schnell trades some prompt fidelity and fine detail for that speed.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

FLUX.1 [dev] VRAM requirements by precision (the core table)

This is the table the head query is really asking for. FLUX.1 [dev] has ~12 billion parameters; the VRAM you need is the model weights plus the T5-XXL text encoder plus a bit of working memory. The figures below are practical totals people hit in ComfyUI, and they include using an FP8 T5 encoder on the tight tiers (a full-precision T5 is ~9.2 GB by itself).

Precision / quant	Model file	File size	Practical VRAM	Quality
FP16 (full)	flux1-dev.safetensors	~23.8 GB	~24 GB (up to 33 GB w/ overhead)	Reference
FP8	flux1-dev-fp8.safetensors	~11.9 GB	~12-16 GB	~99% of FP16
GGUF Q8_0	flux1-dev-Q8_0.gguf	~12.7 GB	~12-14 GB	Excellent
GGUF Q6_K	flux1-dev-Q6_K.gguf	~9.9 GB	~10-12 GB	Very good
GGUF Q5_K_S	flux1-dev-Q5_K_S.gguf	~8.3 GB	~8-10 GB	Very good
GGUF Q4_K_S	flux1-dev-Q4_K_S.gguf	~6.8 GB	~6-8 GB	Good
NF4 (4-bit)	flux1-dev-bnb-nf4-v2.safetensors	~12 GB file (bundles T5+CLIP+VAE)	~6-8 GB	Good

A few honest notes. FP8 is the genuine sweet spot: it is a single safetensors file, needs no extra GGUF node, and produces images visually indistinguishable from FP16 in most prompts while halving memory. GGUF Q8 is marginally smaller than FP8 in VRAM and slightly higher fidelity, but needs the ComfyUI-GGUF node. Below Q4 (Q3/Q2) the model starts to lose anatomy and text rendering, so Q4 is the practical floor for FLUX.1 [dev]. You can confirm every GGUF size on the city96/FLUX.1-dev-gguf model card.

FLUX VRAM by GPU card — what actually fits

Now map that to real hardware. This is the "will it run on my card" reference, including the newer FLUX.2 family so you can see why most people stay on FLUX.1.

GPU	VRAM	FLUX.1 dev (recommended file)	FLUX.2 klein 4B	FLUX.2 dev (32B)
RTX 3060 / 4060 / 3070	8 GB	Q4_K_S GGUF + --lowvram	No (needs ~13 GB FP16)	No
RTX 3060 12GB / 4070	12 GB	Q5_K_S or Q6_K GGUF	Yes (FP16 / GGUF)	No
RTX 4060 Ti / 5060 Ti	16 GB	FP8 (full single file)	Yes, comfortably	No
RTX 3090 / 4090 / 5090	24 GB	FP16 full quality	Yes	Only Q4 GGUF, T5 on CPU
A6000 / H100	48-80 GB	FP16 + big batches	Yes	FP8 (~32 GB) / FP16
Apple Silicon (M-series)	Unified (16-128 GB)	GGUF Q5-Q8 via ComfyUI/Draw Things	Yes (32GB+ recommended)	Quantized only

The RTX 5060 Ti 16GB deserves a callout because it is the current value pick for FLUX: 16 GB is exactly enough for the FP8 single-file workflow without any low-VRAM gymnastics. We break that card down in detail in the RTX 5060 Ti 16GB for local AI guide. For the wider GPU landscape, see best GPUs for AI.

Which exact file should you download per VRAM tier?

People waste hours downloading the wrong 12 GB file. Here is the precise download recipe per tier. Every FLUX setup also needs three shared support files in addition to the main model: the CLIP-L encoder (clip_l.safetensors), the T5-XXL encoder, and the VAE (ae.safetensors).

8 GB GPU: main model = flux1-dev-Q4_K_S.gguf (~6.8 GB) into ComfyUI/models/unet/; text encoder = t5xxl_fp8_e4m3fn.safetensors (the FP8 T5, ~4.6 GB — the FP16 T5 alone would not fit). Install the ComfyUI-GGUF custom node and load the model with its "Unet Loader (GGUF)" node.
12 GB GPU: main model = flux1-dev-Q5_K_S.gguf (~8.3 GB) or Q6_K (~9.9 GB) into models/unet/; you can use the FP8 T5 to leave room for a longer prompt and a LoRA.
16 GB GPU: main model = flux1-dev-fp8.safetensors (~11.9 GB) into models/diffusion_models/ (or models/unet/); no GGUF node needed; FP16 T5 works but FP8 T5 is safer headroom.
24 GB GPU: main model = flux1-dev.safetensors (~23.8 GB) into models/diffusion_models/ with the FP16 t5xxl_fp16.safetensors for full reference quality.

The official ComfyUI FLUX example pages and the black-forest-labs/FLUX.1-dev model card document the exact filenames and folders. For a full step-by-step ComfyUI walkthrough including the workflow graph, see our complete ComfyUI guide.

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Start free in 30 seconds See pricing

ComfyUI low-VRAM flags that make FLUX fit

If FLUX out-of-memory errors hit, these are the levers, in order of how much they help:

Quantize the model — drop from FP16 to FP8 or a GGUF Q-level. This is the single biggest win and the first thing to try.
Use the FP8 T5 encoder (t5xxl_fp8_e4m3fn.safetensors, ~4.6 GB) instead of the FP16 T5 (~9.2 GB). That alone frees roughly 4-5 GB.
Set weight_dtype to fp8_e4m3fn in the "Load Diffusion Model" node — this casts the diffusion weights to FP8 on the fly and roughly halves their memory at a tiny quality cost.
Launch ComfyUI with low-VRAM flags. Start with --lowvram (offloads to system RAM as needed); use --novram only as a last resort on tiny cards because it is much slower. Example: python main.py --lowvram.
Drop resolution to 768×768 or 512×512 while testing; 1024×1024 costs the most working memory.
Keep batch size at 1 and close other GPU apps (browser tabs, Discord) before generating.

For GGUF models, a useful rule of thumb is that the file size on disk is close to the VRAM the weights occupy, so you can pre-screen a file against your card before downloading. Sizing context and KV-style overhead for any model is covered in our broader VRAM requirements guide.

How quantization affects generation time

Lower quants save VRAM but they are not free on speed — and the relationship is not always intuitive. FP16 and FP8 run at full GPU tensor-core speed. GGUF quants add a small dequantization step per layer, so a Q4 GGUF that fits comfortably can actually be a touch slower per step than FP8 even though it uses less memory. The real cliff is offloading: the moment ComfyUI spills layers to system RAM (which --lowvram does when the model does not fully fit), generation time can multiply several times over.

In my own testing on an RTX 3090 (24GB), FLUX.1 [dev] at full FP16 generates a 1024×1024 image in roughly 10-18 seconds at 20 steps, FP8 lands in a similar range, and a GGUF Q4 on a tight 8GB card with --lowvram offloading stretched to well over a minute per image — these are approximate single-machine numbers at 20 steps, not a controlled benchmark, and your sampler/step count will move them. The takeaway: pick the highest quant that fully fits your VRAM, because fitting fully matters far more than the quant label.

Setup	Quant	Approx time / 1024px (20 steps)	Note
RTX 3090/4090 24GB	FP16	~10-18 s	Reference speed
RTX 5060 Ti 16GB	FP8	~14-22 s	Fits as a single file
RTX 3060 12GB	Q5_K_S GGUF	~25-40 s	Small dequant overhead
RTX 3060 8GB	Q4_K_S + --lowvram	~60 s+	Offloading is the bottleneck

Treat these as ballpark and hardware-dependent; FLUX.1 [schnell] cuts all of them dramatically because it needs only 1-4 steps instead of 20.

FLUX.2 [dev] and [klein] — the new VRAM picture

FLUX.2 changes the math. FLUX.2 [dev] (released November 25, 2025) is a 32B-parameter model — far larger than FLUX.1's 12B. Its weights are roughly 64 GB at BF16, which combined with text-encoder and activation overhead does not fit on a single 80GB H100 unoptimized (Black Forest Labs ships a sequential-offload path for H100); FP8 brings it to roughly 32 GB; and a Q4 GGUF compresses it to about 19 GB, which a 24GB RTX 4090 can run only with the text encoder offloaded to the CPU. It is non-commercial and built for H100-class GPUs, so it is not a practical local pick for most readers.

FLUX.2 [klein] (released January 15, 2026) is the consumer answer. It ships as a 4B model under an Apache 2.0 license (free commercial use, ~13 GB at FP16, runs on a 12GB+ card and on ~8 GB when quantized) and a 9B model under a non-commercial license (~29 GB at FP16, runs on 16GB quantized). The 4B is step-distilled to about 4 inference steps and generates in roughly a second on a capable GPU — the first FLUX.2 weight most people can actually run locally.

FLUX.2 variant	Parameters	License	VRAM (FP16)	Local-friendly?
FLUX.2 [dev]	32B	Non-commercial	~64 GB BF16, >80GB w/ overhead (FP8 ~32 GB, Q4 ~19 GB)	Data-center only
FLUX.2 [klein] 9B	9B	Non-commercial	~29 GB (16 GB quantized)	16-24 GB cards
FLUX.2 [klein] 4B	4B	Apache 2.0	~13 GB	Yes (12GB+)

For now, FLUX.1 [dev]/[schnell] still has the deepest ComfyUI tooling, LoRA library, and ControlNet support, so many local users stay on it even after FLUX.2 launched. A full setup walkthrough for both generations is in our run FLUX.1 locally guide.

FLUX on Apple Silicon (unified memory)

Apple Silicon Macs do not have separate VRAM — the GPU shares the system's unified memory, so a 32GB Mac can load models that would never fit on a 24GB discrete card. FLUX.1 [dev] runs on M-series Macs through ComfyUI (with MPS acceleration) or the Mac-native Draw Things app, which uses Metal and is generally a bit faster than ComfyUI on the same chip. Use a GGUF Q5-Q8 quant to keep memory comfortable; 16GB unified is the practical minimum and 32GB+ is recommended for relaxed use. Expect roughly 1-3 minutes per 1024×1024 image on M-series chips — slower than a fast NVIDIA card, but silent, low-power, and entirely local.

Key Takeaways

FLUX.1 [dev] needs ~24 GB at FP16, ~12 GB at FP8, and 6-8 GB at GGUF Q4 — which is why even an 8GB RTX 3060/4060 can run it with the right file.
Match the file to your card: Q4_K_S GGUF (~6.8 GB) for 8GB, Q5/Q6 GGUF for 12GB, FP8 (~11.9 GB) for 16GB, FP16 (~23.8 GB) for 24GB.
FP8 is the sweet spot on 16GB — a single safetensors file, no extra node, ~99% of FP16 quality at half the VRAM.
The FP8 T5 encoder frees ~4-5 GB and is the second-biggest VRAM lever after quantizing the model; --lowvram is the offload safety net but it slows generation.
FLUX.2 [dev] (32B) is data-center-class (~64 GB BF16, more than a single 80GB H100 with overhead; ~32 GB FP8); for new-model local use reach for FLUX.2 [klein] 4B (Apache 2.0, ~13 GB) or stay on FLUX.1.

Next Steps

Setting FLUX up from scratch? Follow our Run FLUX.1 Locally guide with the full file list and 5-minute ComfyUI install.
Want the workflow graph and node-by-node setup? See the complete ComfyUI guide.
Sizing other models against your GPU? Read the broader VRAM requirements guide.
Shopping for a card for FLUX? Compare the value pick in RTX 5060 Ti 16GB for local AI and the full lineup in best GPUs for AI.

FLUX VRAM Requirements by GPU (2026): 8GB to 24GB Guide

Want to go deeper than this article?

What is the best FLUX model for 8GB, 12GB, and 16GB VRAM?

Reading articles is good. Building is better.

FLUX.1 [dev] VRAM requirements by precision (the core table)

FLUX VRAM by GPU card — what actually fits

Which exact file should you download per VRAM tier?

Reading articles is good. Building is better.

ComfyUI low-VRAM flags that make FLUX fit

How quantization affects generation time

FLUX.2 [dev] and [klein] — the new VRAM picture

FLUX on Apple Silicon (unified memory)

Key Takeaways

Next Steps

Generating images locally? Take it further.

Liked this? 20 full AI courses are waiting.

Local AI Master Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ready to Go Beyond Tutorials?

Go from reading about AI to building with AI

Related Guides

Run FLUX.1 Locally

Complete ComfyUI Guide

RTX 5060 Ti 16GB for Local AI

Written by the Local AI Master Team

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Generating images locally? Take it further.