Alibaba Qwen · Open-Weight Image Model
Qwen-Image Local: 20B Text-Rendering Model Setup
Qwen-Image is Alibaba's 20-billion-parameter MMDiT text-to-image model, released August 4, 2025 under Apache 2.0. Its standout trait is complex text rendering — readable, correctly-spelled multi-line English and Chinese text inside generated images — where it consistently beats FLUX and SDXL. You can run it locally today in ComfyUI: roughly 40GB VRAM for full BF16, about 16GB for FP8, or down to 8GB VRAM (with 16GB system RAM) using GGUF or Nunchaku 4-bit plus a 4-step Lightning LoRA.
Open weights, runs offline. Qwen-Image ships on Hugging Face under Apache 2.0, so it works in ComfyUI with no API key and no per-image cost. New to local image generation? Start with our local image generation guide.
Key takeaways
- →20B MMDiT, Apache 2.0 — a genuinely large, commercially-usable open image model from Alibaba's Qwen team.
- →Text rendering is the reason to use it: posters, signage, UI mockups, logos with real words — English and Chinese.
- →Scales to your GPU: ~40GB BF16 → ~16GB FP8 → ~8GB via GGUF / Nunchaku + Lightning LoRA.
- →Three ComfyUI paths: native (FP8/BF16), GGUF (low-VRAM), and Nunchaku 4-bit (fastest on small cards).
- →Qwen-Image-Edit-2509 adds multi-image editing (1-3 inputs) and native ControlNet (depth, edge, keypoint).
TL;DR
Pick Qwen-Image when you need legible text inside the image — the one task most diffusion models still fumble. It is a 20B MMDiT model (Alibaba Qwen, Aug 4 2025, Apache 2.0) that renders accurate multi-line English and Chinese typography, and the Qwen family stays among the strongest open image models: its later Qwen-Image-2512 refresh (Dec 31 2025) ranks as one of the top open-source text-to-image models on the community lmarena.ai Image Arena (per Alibaba and lmarena.ai). For everything else — general aesthetics, photorealism, speed — FLUX or SDXL are still fine choices.
To run it locally the fastest way: install ComfyUI, then choose a format that fits your card — FP8 (~16GB VRAM) on a 4080/4090-class GPU, or a GGUF / Nunchaku 4-bit build with a 4-step Lightning LoRA if you only have 8GB VRAM and 16GB system RAM.
Specs at a glance
| Attribute | Qwen-Image |
|---|---|
| Developer | Alibaba (Qwen Team) |
| Release date | August 4, 2025 |
| Parameters | 20B |
| Architecture | MMDiT (Multimodal Diffusion Transformer) |
| Standout strength | Complex text rendering (EN + CN) |
| License | Apache 2.0 |
| Open weights? | Yes (Hugging Face / ModelScope) |
| Editing variant | Qwen-Image-Edit (Aug 18, 2025) · Edit-2509 (Sep 22, 2025) |
| Local runtimes | ComfyUI (native / GGUF / Nunchaku), Diffusers, DiffSynth-Studio |
| Min. local VRAM | ~8GB (GGUF / Nunchaku 4-bit + 16GB system RAM) |
Sources: QwenLM/Qwen-Image (GitHub) and the official Qwen-Image announcement. File sizes and VRAM figures vary by build and quantization — verify the specific checkpoint you download.
VRAM tiers: which build fits your GPU
This is the table that actually decides your setup. Qwen-Image is a big model, but quantization and the Lightning LoRA bring it within reach of 8GB cards. Numbers below are the model weights plus realistic ComfyUI overhead (a text encoder and VAE add roughly 8-10GB on top, which the quantized paths also shrink).
| Format | Model file size | Practical VRAM | Best for |
|---|---|---|---|
| BF16 (full) | ~40.9 GB | ~40GB+ (A100 / H100 / dual 24GB) | Max quality, research, batch rendering |
| FP8 | ~20.4 GB | ~16-20GB (RTX 4080 / 4090) | The sweet spot — near-full quality, no GGUF |
| GGUF Q4_K_S | ~12-13 GB | ~13GB (RTX 3060 12GB / 4070) | Mid-range cards, good quality/size balance |
| GGUF Q2-Q3 / Nunchaku 4-bit | ~7-9 GB | 8GB VRAM + 16GB system RAM | Budget GPUs; pair with 4-step Lightning LoRA |
Sources: ComfyUI Wiki Qwen-Image guide and the official ComfyUI Qwen-Image tutorial. BF16 weights ~40.9GB and FP8 ~20.4GB are the published checkpoint sizes; the Nunchaku 4-bit Lightning build documents an 8GB-VRAM / 16GB-RAM minimum. Add the text encoder (FP16 ~16GB, FP8 ~9GB) unless you use a quantized encoder too.
Compare with FLUX first. If you are budgeting a GPU specifically for image work, our best GPU for image generation guide maps VRAM tiers to real cards — the same tiers apply directly to Qwen-Image.
The three ComfyUI workflows
ComfyUI has native Qwen-Image support, and the community adds two more low-VRAM paths. Pick one based on the VRAM tier above:
1. Native (FP8 / BF16)
The official path. Drop the diffusion model into ComfyUI/models/diffusion_models, the text encoder and VAE into their folders, and load the built-in Qwen-Image template. Best quality; needs ~16GB+ VRAM for the FP8 build.
2. GGUF (low-VRAM)
Install the ComfyUI-GGUF custom node and load a quantized .gguf checkpoint (Q2 through Q8). Q4_K_S is the usual quality/size pick for 12-13GB cards; smaller quants reach 8GB. Bigger files mean better quality but more VRAM.
3. Nunchaku 4-bit (fastest small-card path)
Nunchaku ships a 4-bit Lightning build with 4/8-step inference (needs ComfyUI-nunchaku and ComfyUI ≥ 0.3.60). Documented minimum: 8GB VRAM + 16GB system RAM. Tune num_blocks_on_gpu and pin memory for best throughput. This is the path most 8GB-card owners should use.
If ComfyUI itself is new to you, our complete ComfyUI guide covers installation, the node graph, and custom-node managers — all three Qwen-Image workflows build on it. Prefer something simpler than node graphs? See image generation via Ollama-style tools for lighter front-ends.
Download & run (the short version)
- Install and update ComfyUI (use a recent build — the Nunchaku path needs ≥ 0.3.60). See our ComfyUI guide.
- Grab the checkpoint that matches your VRAM tier from the official Qwen/Qwen-Image on Hugging Face (or a Comfy-Org / GGUF / Nunchaku repackage for the quantized builds).
- Place the diffusion model, text encoder, and VAE into their ComfyUI model folders.
- Load the Qwen-Image template (native, GGUF, or Nunchaku) and, on small cards, add the 4-step Lightning LoRA.
- Write a prompt that names the exact text you want rendered (e.g. a sign reading “OPEN 24 HOURS”), then queue.
Always read the model card before commercial use. Qwen-Image is Apache 2.0, which is permissive, but checkpoint repackages can carry their own terms.
Qwen-Image-Edit and Edit-2509
The same family includes an editing model. Qwen-Image-Edit launched August 18, 2025; Qwen-Image-Edit-2509 (September 22, 2025) is the version most people run today. It adds two things that matter for real work:
- →Multi-image editing — feed 1 to 3 input images (person + person, person + product, person + scene) under a single prompt for merges and consistent edits.
- →Native ControlNet — built-in depth, edge, and keypoint conditioning for precise pose and structure control, no separate ControlNet model juggling.
Edit-2509 has a native ComfyUI workflow, and Nunchaku ships a 4-bit Lightning build of it too, so the same 8GB-VRAM tier applies. Because Qwen renders text so well, the editing model is especially good at rewriting or fixing text inside an existing image — swapping a sign, correcting a misspelled label — which most editors mangle.
Qwen-Image vs FLUX vs SDXL: when to pick which
| Your goal | Best pick | Why |
|---|---|---|
| Posters, signage, logos, UI mockups (real words) | Qwen-Image | Class-leading legible text rendering, EN + CN. |
| Chinese-language text in images | Qwen-Image | Trained for logographic scripts at commercial quality. |
| General aesthetics / prompt adherence | FLUX | Strong all-rounder with a large LoRA ecosystem. |
| Lightest hardware / huge model + LoRA library | SDXL | Smaller, fastest on modest GPUs, biggest community. |
| Multi-image editing + ControlNet | Qwen-Image-Edit-2509 | Native depth/edge/keypoint + 1-3 image inputs. |
The honest summary: Qwen-Image is not a blanket replacement for FLUX or SDXL. It is the model you reach for when the typography has to be correct. For a broader side-by-side of local image models, see our FLUX local guide and the GPU breakdown in best GPU for image generation.
Our test notes (approximate)
We ran the Nunchaku 4-bit Lightning build on a 12GB RTX 3060 with 32GB system RAM. With the 4-step Lightning LoRA, a 1024×1024 generation landed in roughly the 8-15 second range per image once the model was warm — figures are approximate and depend heavily on attention backend, step count, and whether the model stays resident in VRAM between runs. Peak VRAM hovered comfortably under the card's limit, in line with the documented 8GB + 16GB-RAM minimum.
Where it earned its keep was text: prompts asking for a specific multi-word sign came back legible and correctly spelled on the first or second try, which is not our experience with same-tier SDXL setups. Treat these as directional notes, not a benchmark — your numbers will differ with hardware, quant level, and sampler settings.
Who should pick Qwen-Image
| If you are… | Recommendation |
|---|---|
| A designer making posters, ads, or mockups with text | Yes — this is the model's core strength. |
| Working in Chinese (or bilingual EN/CN) layouts | Yes — class-leading CN text rendering. |
| On an 8GB GPU | Yes, via GGUF / Nunchaku 4-bit + Lightning LoRA (16GB RAM). |
| Chasing maximum general aesthetics or speed | Try FLUX or SDXL first. |
| Combining or precisely editing existing images | Use Qwen-Image-Edit-2509 (multi-image + ControlNet). |
Run image models locally, end to end
Qwen-Image, FLUX, SDXL, ControlNet, LoRAs — the Local AI Master course walks you through installing ComfyUI, picking the right VRAM tier, and wiring up real text-rendering and editing workflows on your own hardware, with zero per-image cost.
See the deployment course →Related models & guides
- → Qwen2-VL 7B — Qwen's vision-language model for understanding images (not generating them)
- → Complete ComfyUI guide — the runtime all three Qwen-Image workflows build on
- → FLUX local image generation — the strong general-purpose alternative
- → Best GPU for image generation — map the VRAM tiers above to real cards
- → Ollama image generation models — lighter front-ends for local image work
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Written by the Local AI Master Team
The team behind Local AI Master
We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.