Wan 2.2 Local Video Generation Guide (2026): Best Open Video AI for 24GB GPUs
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Go from reading about AI to building with AI 20 structured courses. Hands-on projects. Runs on your machine. Start free.
Wan 2.2 is Alibaba's open-source video model that finally makes local video generation practical on consumer GPUs. Released July 28, 2025 under the permissive Apache 2.0 license, it was the first video diffusion model to ship a Mixture-of-Experts (MoE) architecture. The dense TI2V-5B variant generates a 5-second 720p clip at 24fps on a single 24 GB GPU (RTX 4090) in under ~9 minutes, while the heavier MoE A14B models (27B total weights, 14B active per step) push quality higher for users with more VRAM. For ComfyUI users with an RTX 4090 / 7900 XTX / Mac Studio, Wan 2.2 is the newest Wan release with open downloadable weights and the right open-source video generator in 2026.
This guide covers everything: the real model variants (dense TI2V-5B, MoE T2V-A14B and I2V-A14B, plus the newer S2V-14B and Animate-14B), ComfyUI installation (native nodes and Kijai's WanVideoWrapper), GGUF quantization, prompting techniques, image-to-video workflows, video-to-video for style transfer, last-frame chaining for longer clips, and benchmarks vs HunyuanVideo and Mochi.
Version note (mid-2026): As of this update, Wan 2.2 is the latest version with open, downloadable weights on the official Wan-Video GitHub and Wan-AI HuggingFace org. You may see third-party sites advertising "Wan 2.5 / 2.6 / 2.7" — those refer to closed, API-only or hosted commercial offerings (or are SEO pages), not open weights you can run locally. This guide is about what you can actually download and run yourself.
Table of Contents
- What Wan 2.2 Is
- Variants: TI2V-5B, T2V-A14B, I2V-A14B, S2V, Animate
- Hardware Requirements
- Wan 2.2 vs HunyuanVideo vs Mochi
- Installation: ComfyUI + WanVideoWrapper
- Downloading Models
- Your First Text-to-Video
- Image-to-Video Workflow
- Prompt Engineering
- GGUF Quantization for Tight VRAM
- Video-to-Video Style Transfer
- Extending Beyond 5 Seconds
- Frame Interpolation (RIFE / FILM)
- Upscaling 720p → 1440p / 4K
- Performance Benchmarks
- Tuning Recipes
- Licensing
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What Wan 2.2 Is {#what-it-is}
Wan 2.2 (Tongyi Wanxiang 2.2) is Alibaba's video diffusion model, released July 28, 2025. Architecture: a DiT-based Diffusion Transformer operating on temporally-encoded latent video, and — for the A14B models — the first Mixture-of-Experts (MoE) design in a video diffusion model. Two specialized experts handle the denoising process: a high-noise expert sets overall layout in the early steps, then a low-noise expert refines detail in later steps, with the handoff chosen by a signal-to-noise threshold. That means each A14B model holds 27B total parameters but only activates ~14B per step, keeping inference cost closer to a 14B dense model.
The dense TI2V-5B uses a high-compression Wan2.2 VAE (16×16×4 latent compression) so a single 5B model can do both text-to-video and image-to-video at 720p/24fps on a 24 GB GPU.
Capabilities:
- Text-to-Video (T2V) — prompt → ~5-second clip (A14B or TI2V-5B)
- Image-to-Video (I2V) — first frame + prompt → motion clip (A14B or TI2V-5B)
- Speech/Audio-to-Video (S2V-14B) — drive a character clip from audio
- Character animation (Animate-14B) — animate / reanimate a subject
- Video-to-Video (V2V) — input clip + prompt → restyled clip (community workflows)
Project: github.com/Wan-Video/Wan2.2. Model weights on the Wan-AI HuggingFace org.
Variants: TI2V-5B, T2V-A14B, I2V-A14B, S2V, Animate {#variants}
| Variant | Params | Type | Best for |
|---|---|---|---|
| Wan 2.2 TI2V-5B | 5B dense | Unified T2V + I2V | The consumer pick — runs on 24 GB |
| Wan 2.2 T2V-A14B | 27B total / 14B active | MoE text-to-video | Highest-quality T2V |
| Wan 2.2 I2V-A14B | 27B total / 14B active | MoE image-to-video | Highest-quality I2V (most popular) |
| Wan 2.2 S2V-14B | 14B | Speech/audio-to-video | Talking / audio-driven character clips |
| Wan 2.2 Animate-14B | 14B | Character animation | Animate or reanimate a subject |
The two A14B models are MoE: 27B weights on disk, but only ~14B activate per denoising step. On a 24 GB card you run them with GGUF/FP8 quantization plus block offloading. The dense TI2V-5B is the one most local users start with — it fits 24 GB comfortably and covers both text-to-video and image-to-video. (There is no 1.3B Wan 2.2 model — the 1.3B variant existed in Wan 2.1.)
Hardware Requirements {#requirements}
| GPU | Workflow |
|---|---|
| RTX 3060 / 4060 8-12 GB | TI2V-5B at Q5 GGUF (tight, reduced quality) |
| RTX 4070 16 GB | TI2V-5B (GGUF); A14B Q4 with heavy offload |
| RTX 4090 / 5090 / 7900 XTX 24-32 GB | TI2V-5B comfortably; A14B with GGUF/FP8 + offload |
| RTX 5090 32 GB / Pro W7900 48 GB | A14B at higher precision |
| Mac Studio M-series 64-128 GB | TI2V-5B via MPS (slow but works) |
Alibaba's reference: TI2V-5B needs at least 24 GB VRAM for single-GPU inference. System RAM 32 GB+ recommended. Disk 50-100 GB for the model collection.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Wan 2.2 vs HunyuanVideo vs Mochi {#comparison}
| Property | Wan 2.2 TI2V-5B | Wan 2.2 A14B (MoE) | HunyuanVideo 13B | Mochi 10B |
|---|---|---|---|---|
| Quality | Strong for size | Excellent | Best | Good |
| Min VRAM (24GB GPU) | Fits ✅ | GGUF/FP8 + offload | Q4: 24 GB tight | Native: 16 GB ✅ |
| Render time (5sec/720p) | ~9 min | longer (27B weights) | 12-20 min | 5-8 min |
| Motion quality | Good | Excellent | Best | Cleanest |
| Long-clip coherence | Limited | Good | Best | Limited |
| ComfyUI support | Full (native + Kijai) | Full (Kijai handles MoE) | Full | Full |
| License | Apache 2.0 | Apache 2.0 | Permissive | Apache 2.0 |
For most consumer GPU users in 2026: Wan 2.2 TI2V-5B is the easy default. For maximum Wan quality with more VRAM: the A14B models (run through Kijai's wrapper, which routes the high/low-noise experts efficiently). For cinematic output with 32+ GB / multi-GPU: HunyuanVideo.
Installation: ComfyUI + WanVideoWrapper {#installation}
Prerequisite: working ComfyUI install. See ComfyUI Complete Guide.
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper
cd ComfyUI-WanVideoWrapper
pip install -r requirements.txt
Restart ComfyUI. The Wan-specific nodes appear under the WanVideo category.
For GGUF support: also install ComfyUI-GGUF (city96):
cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF
Downloading Models {#downloading}
The simplest route is the official ComfyUI repackaged repo (Comfy-Org), which ships the diffusion model, text encoder, and VAE already split for ComfyUI. Start with the TI2V-5B — it is the lightest and covers both text-to-video and image-to-video.
Wan 2.2 TI2V-5B (recommended start)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/diffusion_models/wan2.2_ti2v_5B_fp16.safetensors \
--local-dir ComfyUI/models/diffusion_models
Required text encoder (UMT5-XXL)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors \
--local-dir ComfyUI/models/text_encoders
VAE (Wan 2.2 high-compression VAE)
huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
split_files/vae/wan2.2_vae.safetensors \
--local-dir ComfyUI/models/vae
The TI2V-5B uses the new high-compression Wan2.2 VAE (16×16×4). The MoE A14B models use the earlier Wan VAE — check the model card for the exact VAE each variant expects rather than assuming one works everywhere. For tight VRAM, GGUF quantizations of these models are published by community quantizers (e.g. city96) and load through ComfyUI-GGUF; pick the exact filename from the quant repo you choose.
Your First Text-to-Video {#first-t2v}
Use the official Wan 2.2 native workflow (ComfyUI → Templates → Video → Wan 2.2), or load the example workflow from ComfyUI-WanVideoWrapper/example_workflows/. A native TI2V-5B text-to-video graph looks like:
[Load Diffusion Model] → MODEL (wan2.2_ti2v_5B_fp16.safetensors)
[Load CLIP] → CLIP (umt5_xxl_fp8_e4m3fn_scaled)
[Load VAE] → VAE (wan2.2_vae)
[CLIP Text Encode] → CONDITIONING positive
[CLIP Text Encode] → CONDITIONING negative
[EmptyLatentVideo / Wan empty latent] → LATENT (set 1280x704, ~24fps, ~5 sec)
[KSampler] (~30 steps)
[VAE Decode (Tiled)] → IMAGES
[VHS_VideoCombine] → MP4 file
Click Queue Prompt. On an RTX 4090, the TI2V-5B produces a 5-second 720p clip in under ~9 minutes. The MoE A14B graphs add a second (high-noise / low-noise) model loader; Kijai's WanVideoWrapper wires that routing for you.
Image-to-Video Workflow {#i2v}
TI2V-5B handles image-to-video in the same unified model — add an image input and use the I2V node from the native template or the wrapper:
[Load Image] → IMAGE (your start frame)
[Wan I2V / WanVideoImageToVideo] → LATENT (combines image + prompt + frame count)
For the dedicated I2V-A14B model, follow its native/wrapper example workflow (the MoE graph loads both the high-noise and low-noise experts). Best practice: start with a high-quality 720p image. Composition and color of the start frame strongly drive the rest of the clip.
Prompt Engineering {#prompts}
Wan responds to:
- Camera moves: dolly, tracking, panning, aerial, drone, handheld
- Lens: 24mm wide, 50mm portrait, 85mm telephoto, anamorphic
- Lighting: golden hour, blue hour, neon, soft natural, harsh midday
- Motion: slow-motion, fast pan, time-lapse, freeze frame
- Atmosphere: misty, foggy, smoky, hazy, dust particles, lens flare
- Style: cinematic, photorealistic, documentary, music video
Example prompt:
Cinematic tracking shot of a lone samurai walking through misty bamboo forest at dawn, soft volumetric god rays, 35mm anamorphic lens, slow-motion, fluttering leaves, atmospheric haze, deep depth of field, color graded teal and orange.
Negative prompt:
blurry, deformed, duplicate frames, jittery motion, watermark, text, low quality, oversaturated, washed out
For I2V, prompt drives motion not composition — the input image handles composition.
GGUF Quantization for Tight VRAM {#gguf}
GGUF lets you trade a little quality for a lot of VRAM headroom. The exact footprint depends on which model you quantize — the dense TI2V-5B is small enough that many users run it at FP16/FP8, while the A14B MoE models (27B weights on disk) usually need quantization plus block offloading to fit a 24 GB card.
General quant trade-off (relative, not exact VRAM):
| Quant | Quality | Speed | Notes |
|---|---|---|---|
| FP16 / BF16 | Reference | Slowest | Full precision |
| FP8 | ~near-reference | Faster | Common for the A14B on 24 GB |
| Q8_0 | ~99% of reference | Fast | Best GGUF quality |
| Q6_K | ~97% of reference | Faster | Good balance |
| Q5_K_M | ~94% of reference | Faster | Tight 12-16 GB cards |
| Q4_K_S | ~90%, visible drop | Fastest | Last resort on 8-12 GB |
For 24 GB GPUs: run TI2V-5B at FP16/FP8, or the A14B at FP8/Q8 with offload. For 8-12 GB cards: a Q5 GGUF of TI2V-5B. Pick the exact GGUF filename from the quant repo you download — sizes vary by quantizer.
Video-to-Video Style Transfer {#v2v}
V2V workflow:
[Load Video] → IMAGES (input clip)
[VAE Encode (Tiled)] → LATENT
[KSampler] (denoise 0.4-0.6) → LATENT
[VAE Decode (Tiled)] → IMAGES
Lower denoise (0.3-0.5) preserves motion and composition; higher (0.6-0.8) deviates more.
For consistent style, pair with a style LoRA. Community fine-tunes for anime, oil painting, comic book, and various film looks are on HuggingFace.
Extending Beyond 5 Seconds {#extending}
Two reliable approaches:
- Last-frame-as-first-frame chaining: generate clip A, extract its last frame, feed it as the I2V input for clip B, repeat. Maintains visual continuity but loses long-range coherence after a couple of chains.
- Manual editing in DaVinci Resolve / Premiere: generate several separate ~5-second shots from your storyboard and edit them together with audio.
Approach 2 is recommended for any narrative content — treat Wan 2.2 as a per-shot tool. Longer single-pass clips are the headline target of the next (not-yet-open) generation, so do not plan around it for local work today.
Frame Interpolation (RIFE / FILM) {#interpolation}
24fps → 60fps for smoother playback:
# Install ComfyUI-Frame-Interpolation
cd ComfyUI/custom_nodes
git clone https://github.com/Fannovel16/ComfyUI-Frame-Interpolation
In workflow, add RIFE VFI node after VAE Decode. Set interpolation factor to 2.5 (24→60). Inference cost minimal (~30 sec for 5-sec clip on RTX 4090).
Upscaling 720p → 1440p / 4K {#upscaling}
Use Real-ESRGAN x2 / x4 or Topaz Video AI (commercial) on the rendered MP4.
In ComfyUI:
[Upscale Image (using Model)] (Real-ESRGAN x2)
Apply per-frame after VAE decode. Combined render+interp+upscale on RTX 4090: ~10-15 min for a 5-sec 1440p 60fps clip.
Performance Benchmarks {#benchmarks}
5-second 720p clip, ~30 steps, RTX 4090:
| Variant | Time |
|---|---|
| Wan 2.2 TI2V-5B (Alibaba reference) | under ~9 min |
| Wan 2.2 A14B (MoE, FP8/Q8 + offload) | longer — depends on offload setup |
| HunyuanVideo Q4 (comparison) | 12-20 min |
| Mochi (comparison) | 5-8 min |
The TI2V-5B "under 9 minutes" figure is Alibaba's published single-GPU reference; the A14B models are heavier and their time depends heavily on your quant + offloading configuration. The HunyuanVideo and Mochi rows are internal comparison runs. For 7900 XTX: expect noticeably slower than RTX 4090. For Apple Silicon via MPS: the 5B model runs but several times slower than a 4090.
Tuning Recipes {#tuning}
RTX 4090 / 5090 (best quality)
A14B at FP8/Q8 with block offload (run through Kijai's WanVideoWrapper for efficient MoE routing) + ~30 steps. Or TI2V-5B at FP16 for the fast path.
RTX 4070 / 4080 (16 GB)
TI2V-5B (FP8 or Q-quant) + offload the UMT5 text encoder to CPU.
RTX 3060 / 4060 (8-12 GB)
Use a Q5 GGUF of TI2V-5B — slower and lower quality but viable. (There is no Wan 2.2 1.3B model; that was Wan 2.1.)
Apple Silicon
MPS-compatible PyTorch + TI2V-5B on an M-series chip with enough unified memory. Expect several times slower than a 4090.
Licensing {#licensing}
Wan 2.2 ships under the Apache 2.0 license — free for commercial use and redistribution, with no copyleft and no restriction on training other models. The model card adds an acceptable-use note against harmful or deceptive applications. Always confirm the current license on the Wan-AI HuggingFace model card before deploying commercially, since terms can differ between variants and future releases.
For other permissively-licensed video models: OpenSora and CogVideoX.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM at VAE decode | Tiled VAE not enabled | Use VAE Decode (Tiled) node |
| Workflow won't load | Missing custom nodes | Install via Manager → Install Missing |
| Black output | NaN VAE | Use --no-half-vae or fp32 VAE |
| Jittery motion | Too few steps | Increase to 35-40 steps |
| Repeating frames | Wrong scheduler | Use sgm_uniform with dpmpp_2m |
| First/last frame mismatch | I2V CLIP Vision off | Connect CLIP Vision Encode |
| Slow on 7900 XTX | FlashAttention not built | Build FA-2 ROCm fork |
FAQ {#faq}
See answers to common Wan 2.2 questions below.
Sources: Wan-Video GitHub | ComfyUI-WanVideoWrapper | city96 GGUF quants | Internal benchmarks RTX 4090, RX 7900 XTX, M4 Max.
Related guides:
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!