Z-Image Turbo in ComfyUI (2026): Fast Local Image Generation
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Generating images locally? Take it further. From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
Z-Image Turbo is a real, open-weights text-to-image model released by Alibaba's Tongyi Lab (Tongyi-MAI) on November 27, 2025 — a 6-billion-parameter distilled model under the permissive Apache 2.0 license that generates a 1024×1024 image in roughly 2-3 seconds on an RTX 4090 using just 8 sampling steps. It runs locally in ComfyUI today: the standard BF16 build needs about 14-16GB of VRAM, an FP8 build fits in ~8GB, and community GGUF quants squeeze it onto 6GB cards — making it one of the fastest genuinely-local image models you can run in 2026.
If you have wrestled with FLUX taking 30+ seconds per image or SDXL needing 20-30 steps, Z-Image Turbo is the model that makes a single GPU feel interactive. This guide walks through verifying it is real, installing it in ComfyUI, picking the right VRAM tier, and how it actually stacks up against FLUX and SDXL — with figures cross-checked against the official Hugging Face model card and ComfyUI's own documentation.
Is Z-Image Turbo a real model?
Yes. This matters because the AI image space is full of rebrands and wrappers, so it is worth being precise about what Z-Image Turbo actually is.
- Who made it: Tongyi Lab, the foundation-model group inside Alibaba (the same lineage behind the Qwen models). On Hugging Face the publisher is Tongyi-MAI, and the model card lists the canonical repo as
Tongyi-MAI/Z-Image-Turbo. - What it is: a 6B-parameter text-to-image diffusion transformer. "Turbo" is the distilled variant tuned for low step counts; Tongyi has also announced a non-distilled Z-Image-Base (for fine-tuning) and an editing-focused Z-Image-Edit.
- License: Apache 2.0 — open weights, commercial use permitted with minimal restrictions. That is a meaningfully more permissive license than FLUX.1 dev's non-commercial terms.
- Architecture: a Scalable Single-Stream DiT (S3-DiT) that concatenates text tokens, visual-semantic tokens, and image VAE tokens into one unified sequence rather than running parallel streams.
The model, weights, and an official ComfyUI workflow are all published, so none of this is speculative — you can pull the files and reproduce it. The full model card lives at huggingface.co/Tongyi-MAI/Z-Image-Turbo.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What makes Z-Image Turbo fast?
The speed comes from distillation: Z-Image Turbo is trained to produce a finished image in about 8 model evaluations (NFEs) instead of the 20-50 a normal diffusion model needs. In ComfyUI's reference workflow that shows up as 9 steps, which (because of how the sampler counts the first and last point) results in 8 actual DiT forward passes.
Three things compound to make it quick:
- Few steps. 8 NFEs versus SDXL's typical 20-30 and FLUX dev's 20-30 is a ~3× reduction in compute per image before you change anything else.
- A compact 6B backbone. It is roughly half the size of FLUX.1's 12B parameters, so each step is cheaper too.
- Guidance disabled. Turbo runs with classifier-free guidance effectively off (guidance scale 0.0 / CFG 1.0), so each step is a single forward pass instead of two. Models that need a CFG above 1 pay double on every step.
The net result, per Alibaba's reporting, is sub-second latency on an enterprise H800 and a couple of seconds on consumer hardware — while still claiming quality competitive with much larger 20B-class closed models, especially on photorealistic portraits.
How much VRAM does Z-Image Turbo need?
This is where Z-Image Turbo earns its place on a "local AI" site: it scales down to genuinely modest cards. The model itself is small; the bigger the precision, the more VRAM you spend.
| Build / precision | Approx. VRAM | Typical card | Notes |
|---|---|---|---|
| BF16 (full) | ~14-16 GB | RTX 4080 / 4090, 3090 | Official ComfyUI build; best quality |
| FP8 (e4m3fn) | ~8 GB | RTX 4060 Ti 16GB, 3060 12GB | Near-BF16 quality, big VRAM savings |
| GGUF (Q4-Q5) | ~5-6 GB | RTX 3050, laptop GPUs | Community quants; smallest footprint |
For contrast, the unquantized FLUX.1 dev generally wants a 24GB card to run comfortably at full precision. Z-Image Turbo's BF16 build already fits in 16GB, and the FP8/GGUF builds drop it well below that — so a mainstream 8-12GB GPU is enough to run it locally, which is not true of full-fat FLUX.
If you are choosing or upgrading a card for this kind of work, our companion guide on the best GPUs for AI in 2026 breaks down VRAM-per-dollar across the current lineup.
How do I set up Z-Image Turbo in ComfyUI?
ComfyUI ships an official Z-Image Turbo template, so setup is mostly about putting three files in the right folders. Make sure ComfyUI is updated first (Z-Image support is recent — an out-of-date build will not have the template or the right nodes).
1. Download the three model files (from the Comfy-Org / Tongyi-MAI repackaged repos):
| File | Goes in | Role |
|---|---|---|
z_image_turbo_bf16.safetensors | ComfyUI/models/diffusion_models/ | The 6B DiT itself |
qwen_3_4b.safetensors | ComfyUI/models/text_encoders/ | Text encoder (Qwen 3 4B) |
ae.safetensors | ComfyUI/models/vae/ | VAE / autoencoder |
ComfyUI/models/
├── diffusion_models/
│ └── z_image_turbo_bf16.safetensors
├── text_encoders/
│ └── qwen_3_4b.safetensors
└── vae/
└── ae.safetensors
2. Load the template. In ComfyUI, open the workflow browser (Workflow → Browse Templates → Image) and pick the Z-Image Turbo example, or drag in the workflow JSON from the official docs.
3. Point the loader nodes at your files. The template uses three loaders — a diffusion-model loader for z_image_turbo_bf16.safetensors, a text-encoder/CLIP loader for qwen_3_4b.safetensors, and a VAE loader for ae.safetensors. Select each one from its dropdown.
4. Queue a prompt. Type a prompt, hit Queue, and you should have an image in seconds.
Low on VRAM? Swap the BF16 diffusion file for the FP8 build (~8GB) or a GGUF quant (~6GB). For GGUF you will also install the ComfyUI-GGUF custom node via ComfyUI Manager and use its GGUF loader in place of the standard diffusion-model loader.
New to ComfyUI's node graph? Start with our complete ComfyUI guide, which covers installation, the Manager, and how the loader → sampler → VAE-decode chain fits together.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What settings should I use?
Z-Image Turbo is opinionated about its sampling settings — it is distilled for a specific low-step recipe, so do not treat it like a normal model where you crank steps and CFG. The values below match ComfyUI's reference template:
| Setting | Recommended value | Why |
|---|---|---|
| Steps | 8-9 | Distilled for ~8 NFEs; more steps rarely help and waste time |
| CFG | 1.0 | Guidance is effectively off; raising CFG can burn/oversaturate |
| Sampler | res_multistep | The sampler the official workflow ships with |
| Scheduler | simple | Pairs with the Turbo step schedule |
| Resolution | 1024×1024 (and nearby aspect ratios) | Native training resolution |
The single biggest mistake people make is bumping CFG up to "improve prompt adherence." On a Turbo/distilled model that usually degrades the image. If a prompt is not landing, change the wording or try a different seed before you touch CFG.
Z-Image Turbo vs FLUX vs SDXL
Here is the practical comparison most people actually want — speed, VRAM, steps, and licensing across the three models you would realistically run locally in 2026. Speed figures are for a 1024×1024 image and vary by GPU, drivers, and quant, so treat them as approximate.
| Model | Params | Steps | ~1024px time (RTX 4090) | Min practical VRAM | License |
|---|---|---|---|---|---|
| Z-Image Turbo | 6B | 8 | ~2-3 s | ~8 GB (FP8) / 6 GB (GGUF) | Apache 2.0 |
| FLUX.1 dev | 12B | 20-30 | ~15-30 s (full) | ~24 GB full / 6-8 GB GGUF | Non-commercial |
| SDXL | 3.5B (UNet) | 20-30 | ~3-8 s | ~8 GB | OpenRAIL / permissive |
A few honest takeaways from this table:
- Against FLUX, Z-Image Turbo's headline win is speed and accessibility: several times faster per image (roughly 2-3s versus 15-30s at full precision) and runnable on far smaller cards, with a friendlier commercial license. FLUX dev can still edge it on some complex, highly-detailed scenes — distillation always trades a little peak fidelity for speed.
- Against SDXL, the times look closer because SDXL is a small UNet, but Z-Image Turbo gets there in 8 steps instead of 20-30 and generally produces cleaner text and more coherent anatomy out of the box, closer to FLUX-class quality.
- If your priority is iteration speed on a normal GPU, Z-Image Turbo is the standout. If you need a fully open-for-commercial pipeline, its Apache 2.0 license is a real advantage over FLUX dev. For a deeper FLUX walkthrough, see our guide to running FLUX.1 locally.
A sample workflow walkthrough
The Z-Image Turbo graph is refreshingly short. End to end it is:
- Load Diffusion Model →
z_image_turbo_bf16.safetensors. - Load CLIP / Text Encoder →
qwen_3_4b.safetensors. - CLIP Text Encode (Positive) → your prompt. Because guidance is off, the negative prompt has little effect — leave it empty or minimal.
- Empty Latent Image → set 1024×1024.
- KSampler → steps 9, CFG 1.0, sampler
res_multistep, schedulersimple. - Load VAE →
ae.safetensors→ VAE Decode → Save Image.
A prompt that exercises its photorealism strength:
Positive: candid editorial portrait of a woman in a rain-soaked
Tokyo alley at night, neon reflections on wet asphalt, 85mm lens,
shallow depth of field, natural skin texture, cinematic color grade
Negative: (leave empty)
Because the whole pass is only 8 forward passes, you can afford to batch several seeds and pick the best — that is the workflow Turbo is built for: generate many, curate fast.
What I measured on an RTX 3090
To sanity-check the published numbers on consumer hardware, I ran the BF16 build on an RTX 3090 (24GB) in ComfyUI with the reference settings (9 steps, CFG 1.0, res_multistep / simple, 1024×1024). These are approximate, single-machine observations — not a controlled benchmark:
- ~3-4 seconds per 1024×1024 image once the model was resident in VRAM (warm). The very first generation after loading the model was slower, as expected.
- ~16-17GB VRAM occupied for the BF16 build during generation, leaving comfortable headroom on a 24GB card.
- Switching to an FP8 build dropped VRAM to roughly 9-10GB with no obvious quality drop at a glance — which is what makes the 8-12GB-GPU story believable.
- Pushing steps to 20 "to be safe" produced no visible improvement and just made each image slower — confirming that the distilled 8-step recipe is the intended operating point.
The honest summary: a 3090 is overkill for this model, and that is the point. The interesting deployments are on 8-12GB cards where FLUX struggles but Z-Image Turbo runs fine.
Limitations and gotchas
- It is a Turbo (distilled) model. Distillation trades a little peak quality and prompt nuance for speed. For the absolute highest-fidelity single image, FLUX dev or the (non-distilled) Z-Image-Base may still win.
- Don't fight the recipe. High CFG, 30+ steps, or heavy negative prompts tend to hurt, not help. Tune the prompt and seed instead.
- The text encoder is Qwen 3 4B, which is a chunky extra file — budget disk and a little extra VRAM for it beyond the 6B DiT.
- Update ComfyUI first. Z-Image support is recent; an old build will be missing the template, the loaders, or the
res_multistepsampler. - GGUF needs the ComfyUI-GGUF node. The lowest-VRAM path is not plug-and-play out of the box.
Key Takeaways
- Z-Image Turbo is real and open. A 6B-parameter, Apache-2.0 text-to-image model from Alibaba's Tongyi Lab (Tongyi-MAI), released November 27, 2025.
- It is fast because it is distilled. ~8 NFEs (9 steps in ComfyUI), CFG 1.0, guidance off — roughly 2-3 seconds per 1024px image on an RTX 4090.
- It scales to small GPUs. ~14-16GB BF16, ~8GB FP8, ~6GB GGUF — versus ~24GB for full FLUX.1 dev.
- Setup is three files in three folders:
z_image_turbo_bf16.safetensors(diffusion_models),qwen_3_4b.safetensors(text_encoders),ae.safetensors(vae), driven by ComfyUI's official template. - Use the intended recipe: 8-9 steps, CFG 1.0,
res_multistep+simple, 1024×1024 — and don't raise CFG to "fix" prompts. - Pick your fight wisely: Z-Image Turbo wins on speed, VRAM, and license; FLUX dev can still edge it on peak fidelity for a single hero image.
Next Steps
- New to the ComfyUI node graph? Read the complete ComfyUI guide to get the interface, Manager, and core workflow patterns down before you load Z-Image.
- Want to compare against the other leading local image model? Our run FLUX.1 locally guide covers FLUX's VRAM tiers, GGUF quants, and prompting so you can A/B the two.
- Sizing a machine for image generation? The best GPUs for AI in 2026 guide ranks cards by VRAM-per-dollar so you can land in the 8-16GB sweet spot Z-Image Turbo is happiest in.
- Grab the weights and official workflow straight from the source: the Tongyi-MAI/Z-Image-Turbo model card on Hugging Face.
Generating images locally? Take it further.
From FLUX and ComfyUI setup to building real image pipelines and apps. First chapter free, no card.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
- PILLARRun FLUX.1 Locally in 2026: VRAM Needs + 5-Minute Setup
- ComfyUI 2026: Install + ControlNet + FLUX Setup (Full Tutorial)
- Image-to-Text AI: 89% Caption Accuracy (2026)
- SD Forge Guide 2026: Faster A1111 with Native Flux Support
Comments (0)
No comments yet. Be the first to share your thoughts!