★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Image Generation

Wan 2.2 Local Video Generation Guide (2026): Best Open Video AI for 24GB GPUs

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Go from reading about AI to building with AI 20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free
Or own it for life — Lifetime $149, pay once

Wan 2.2 is Alibaba's open-source video model that finally makes local video generation practical on consumer GPUs. Released July 28, 2025 under the permissive Apache 2.0 license, it was the first video diffusion model to ship a Mixture-of-Experts (MoE) architecture. The dense TI2V-5B variant generates a 5-second 720p clip at 24fps on a single 24 GB GPU (RTX 4090) in under ~9 minutes, while the heavier MoE A14B models (27B total weights, 14B active per step) push quality higher for users with more VRAM. For ComfyUI users with an RTX 4090 / 7900 XTX / Mac Studio, Wan 2.2 is the newest Wan release with open downloadable weights and the right open-source video generator in 2026.

This guide covers everything: the real model variants (dense TI2V-5B, MoE T2V-A14B and I2V-A14B, plus the newer S2V-14B and Animate-14B), ComfyUI installation (native nodes and Kijai's WanVideoWrapper), GGUF quantization, prompting techniques, image-to-video workflows, video-to-video for style transfer, last-frame chaining for longer clips, and benchmarks vs HunyuanVideo and Mochi.

Version note (mid-2026): As of this update, Wan 2.2 is the latest version with open, downloadable weights on the official Wan-Video GitHub and Wan-AI HuggingFace org. You may see third-party sites advertising "Wan 2.5 / 2.6 / 2.7" — those refer to closed, API-only or hosted commercial offerings (or are SEO pages), not open weights you can run locally. This guide is about what you can actually download and run yourself.

Table of Contents

  1. What Wan 2.2 Is
  2. Variants: TI2V-5B, T2V-A14B, I2V-A14B, S2V, Animate
  3. Hardware Requirements
  4. Wan 2.2 vs HunyuanVideo vs Mochi
  5. Installation: ComfyUI + WanVideoWrapper
  6. Downloading Models
  7. Your First Text-to-Video
  8. Image-to-Video Workflow
  9. Prompt Engineering
  10. GGUF Quantization for Tight VRAM
  11. Video-to-Video Style Transfer
  12. Extending Beyond 5 Seconds
  13. Frame Interpolation (RIFE / FILM)
  14. Upscaling 720p → 1440p / 4K
  15. Performance Benchmarks
  16. Tuning Recipes
  17. Licensing
  18. Troubleshooting
  19. FAQ

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What Wan 2.2 Is {#what-it-is}

Wan 2.2 (Tongyi Wanxiang 2.2) is Alibaba's video diffusion model, released July 28, 2025. Architecture: a DiT-based Diffusion Transformer operating on temporally-encoded latent video, and — for the A14B models — the first Mixture-of-Experts (MoE) design in a video diffusion model. Two specialized experts handle the denoising process: a high-noise expert sets overall layout in the early steps, then a low-noise expert refines detail in later steps, with the handoff chosen by a signal-to-noise threshold. That means each A14B model holds 27B total parameters but only activates ~14B per step, keeping inference cost closer to a 14B dense model.

The dense TI2V-5B uses a high-compression Wan2.2 VAE (16×16×4 latent compression) so a single 5B model can do both text-to-video and image-to-video at 720p/24fps on a 24 GB GPU.

Capabilities:

  • Text-to-Video (T2V) — prompt → ~5-second clip (A14B or TI2V-5B)
  • Image-to-Video (I2V) — first frame + prompt → motion clip (A14B or TI2V-5B)
  • Speech/Audio-to-Video (S2V-14B) — drive a character clip from audio
  • Character animation (Animate-14B) — animate / reanimate a subject
  • Video-to-Video (V2V) — input clip + prompt → restyled clip (community workflows)

Project: github.com/Wan-Video/Wan2.2. Model weights on the Wan-AI HuggingFace org.


Variants: TI2V-5B, T2V-A14B, I2V-A14B, S2V, Animate {#variants}

VariantParamsTypeBest for
Wan 2.2 TI2V-5B5B denseUnified T2V + I2VThe consumer pick — runs on 24 GB
Wan 2.2 T2V-A14B27B total / 14B activeMoE text-to-videoHighest-quality T2V
Wan 2.2 I2V-A14B27B total / 14B activeMoE image-to-videoHighest-quality I2V (most popular)
Wan 2.2 S2V-14B14BSpeech/audio-to-videoTalking / audio-driven character clips
Wan 2.2 Animate-14B14BCharacter animationAnimate or reanimate a subject

The two A14B models are MoE: 27B weights on disk, but only ~14B activate per denoising step. On a 24 GB card you run them with GGUF/FP8 quantization plus block offloading. The dense TI2V-5B is the one most local users start with — it fits 24 GB comfortably and covers both text-to-video and image-to-video. (There is no 1.3B Wan 2.2 model — the 1.3B variant existed in Wan 2.1.)


Hardware Requirements {#requirements}

GPUWorkflow
RTX 3060 / 4060 8-12 GBTI2V-5B at Q5 GGUF (tight, reduced quality)
RTX 4070 16 GBTI2V-5B (GGUF); A14B Q4 with heavy offload
RTX 4090 / 5090 / 7900 XTX 24-32 GBTI2V-5B comfortably; A14B with GGUF/FP8 + offload
RTX 5090 32 GB / Pro W7900 48 GBA14B at higher precision
Mac Studio M-series 64-128 GBTI2V-5B via MPS (slow but works)

Alibaba's reference: TI2V-5B needs at least 24 GB VRAM for single-GPU inference. System RAM 32 GB+ recommended. Disk 50-100 GB for the model collection.


Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Wan 2.2 vs HunyuanVideo vs Mochi {#comparison}

PropertyWan 2.2 TI2V-5BWan 2.2 A14B (MoE)HunyuanVideo 13BMochi 10B
QualityStrong for sizeExcellentBestGood
Min VRAM (24GB GPU)Fits ✅GGUF/FP8 + offloadQ4: 24 GB tightNative: 16 GB ✅
Render time (5sec/720p)~9 minlonger (27B weights)12-20 min5-8 min
Motion qualityGoodExcellentBestCleanest
Long-clip coherenceLimitedGoodBestLimited
ComfyUI supportFull (native + Kijai)Full (Kijai handles MoE)FullFull
LicenseApache 2.0Apache 2.0PermissiveApache 2.0

For most consumer GPU users in 2026: Wan 2.2 TI2V-5B is the easy default. For maximum Wan quality with more VRAM: the A14B models (run through Kijai's wrapper, which routes the high/low-noise experts efficiently). For cinematic output with 32+ GB / multi-GPU: HunyuanVideo.


Installation: ComfyUI + WanVideoWrapper {#installation}

Prerequisite: working ComfyUI install. See ComfyUI Complete Guide.

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-WanVideoWrapper
cd ComfyUI-WanVideoWrapper
pip install -r requirements.txt

Restart ComfyUI. The Wan-specific nodes appear under the WanVideo category.

For GGUF support: also install ComfyUI-GGUF (city96):

cd ComfyUI/custom_nodes
git clone https://github.com/city96/ComfyUI-GGUF

Downloading Models {#downloading}

The simplest route is the official ComfyUI repackaged repo (Comfy-Org), which ships the diffusion model, text encoder, and VAE already split for ComfyUI. Start with the TI2V-5B — it is the lightest and covers both text-to-video and image-to-video.

huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
    split_files/diffusion_models/wan2.2_ti2v_5B_fp16.safetensors \
    --local-dir ComfyUI/models/diffusion_models

Required text encoder (UMT5-XXL)

huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
    split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors \
    --local-dir ComfyUI/models/text_encoders

VAE (Wan 2.2 high-compression VAE)

huggingface-cli download Comfy-Org/Wan_2.2_ComfyUI_repackaged \
    split_files/vae/wan2.2_vae.safetensors \
    --local-dir ComfyUI/models/vae

The TI2V-5B uses the new high-compression Wan2.2 VAE (16×16×4). The MoE A14B models use the earlier Wan VAE — check the model card for the exact VAE each variant expects rather than assuming one works everywhere. For tight VRAM, GGUF quantizations of these models are published by community quantizers (e.g. city96) and load through ComfyUI-GGUF; pick the exact filename from the quant repo you choose.


Your First Text-to-Video {#first-t2v}

Use the official Wan 2.2 native workflow (ComfyUI → Templates → Video → Wan 2.2), or load the example workflow from ComfyUI-WanVideoWrapper/example_workflows/. A native TI2V-5B text-to-video graph looks like:

[Load Diffusion Model] → MODEL  (wan2.2_ti2v_5B_fp16.safetensors)
[Load CLIP] → CLIP (umt5_xxl_fp8_e4m3fn_scaled)
[Load VAE] → VAE (wan2.2_vae)
[CLIP Text Encode] → CONDITIONING positive
[CLIP Text Encode] → CONDITIONING negative
[EmptyLatentVideo / Wan empty latent] → LATENT (set 1280x704, ~24fps, ~5 sec)
[KSampler] (~30 steps)
[VAE Decode (Tiled)] → IMAGES
[VHS_VideoCombine] → MP4 file

Click Queue Prompt. On an RTX 4090, the TI2V-5B produces a 5-second 720p clip in under ~9 minutes. The MoE A14B graphs add a second (high-noise / low-noise) model loader; Kijai's WanVideoWrapper wires that routing for you.


Image-to-Video Workflow {#i2v}

TI2V-5B handles image-to-video in the same unified model — add an image input and use the I2V node from the native template or the wrapper:

[Load Image] → IMAGE (your start frame)
[Wan I2V / WanVideoImageToVideo] → LATENT (combines image + prompt + frame count)

For the dedicated I2V-A14B model, follow its native/wrapper example workflow (the MoE graph loads both the high-noise and low-noise experts). Best practice: start with a high-quality 720p image. Composition and color of the start frame strongly drive the rest of the clip.


Prompt Engineering {#prompts}

Wan responds to:

  • Camera moves: dolly, tracking, panning, aerial, drone, handheld
  • Lens: 24mm wide, 50mm portrait, 85mm telephoto, anamorphic
  • Lighting: golden hour, blue hour, neon, soft natural, harsh midday
  • Motion: slow-motion, fast pan, time-lapse, freeze frame
  • Atmosphere: misty, foggy, smoky, hazy, dust particles, lens flare
  • Style: cinematic, photorealistic, documentary, music video

Example prompt:

Cinematic tracking shot of a lone samurai walking through misty bamboo forest at dawn, soft volumetric god rays, 35mm anamorphic lens, slow-motion, fluttering leaves, atmospheric haze, deep depth of field, color graded teal and orange.

Negative prompt:

blurry, deformed, duplicate frames, jittery motion, watermark, text, low quality, oversaturated, washed out

For I2V, prompt drives motion not composition — the input image handles composition.


GGUF Quantization for Tight VRAM {#gguf}

GGUF lets you trade a little quality for a lot of VRAM headroom. The exact footprint depends on which model you quantize — the dense TI2V-5B is small enough that many users run it at FP16/FP8, while the A14B MoE models (27B weights on disk) usually need quantization plus block offloading to fit a 24 GB card.

General quant trade-off (relative, not exact VRAM):

QuantQualitySpeedNotes
FP16 / BF16ReferenceSlowestFull precision
FP8~near-referenceFasterCommon for the A14B on 24 GB
Q8_0~99% of referenceFastBest GGUF quality
Q6_K~97% of referenceFasterGood balance
Q5_K_M~94% of referenceFasterTight 12-16 GB cards
Q4_K_S~90%, visible dropFastestLast resort on 8-12 GB

For 24 GB GPUs: run TI2V-5B at FP16/FP8, or the A14B at FP8/Q8 with offload. For 8-12 GB cards: a Q5 GGUF of TI2V-5B. Pick the exact GGUF filename from the quant repo you download — sizes vary by quantizer.


Video-to-Video Style Transfer {#v2v}

V2V workflow:

[Load Video] → IMAGES (input clip)
[VAE Encode (Tiled)] → LATENT
[KSampler] (denoise 0.4-0.6) → LATENT
[VAE Decode (Tiled)] → IMAGES

Lower denoise (0.3-0.5) preserves motion and composition; higher (0.6-0.8) deviates more.

For consistent style, pair with a style LoRA. Community fine-tunes for anime, oil painting, comic book, and various film looks are on HuggingFace.


Extending Beyond 5 Seconds {#extending}

Two reliable approaches:

  1. Last-frame-as-first-frame chaining: generate clip A, extract its last frame, feed it as the I2V input for clip B, repeat. Maintains visual continuity but loses long-range coherence after a couple of chains.
  2. Manual editing in DaVinci Resolve / Premiere: generate several separate ~5-second shots from your storyboard and edit them together with audio.

Approach 2 is recommended for any narrative content — treat Wan 2.2 as a per-shot tool. Longer single-pass clips are the headline target of the next (not-yet-open) generation, so do not plan around it for local work today.


Frame Interpolation (RIFE / FILM) {#interpolation}

24fps → 60fps for smoother playback:

# Install ComfyUI-Frame-Interpolation
cd ComfyUI/custom_nodes
git clone https://github.com/Fannovel16/ComfyUI-Frame-Interpolation

In workflow, add RIFE VFI node after VAE Decode. Set interpolation factor to 2.5 (24→60). Inference cost minimal (~30 sec for 5-sec clip on RTX 4090).


Upscaling 720p → 1440p / 4K {#upscaling}

Use Real-ESRGAN x2 / x4 or Topaz Video AI (commercial) on the rendered MP4.

In ComfyUI:

[Upscale Image (using Model)] (Real-ESRGAN x2)

Apply per-frame after VAE decode. Combined render+interp+upscale on RTX 4090: ~10-15 min for a 5-sec 1440p 60fps clip.


Performance Benchmarks {#benchmarks}

5-second 720p clip, ~30 steps, RTX 4090:

VariantTime
Wan 2.2 TI2V-5B (Alibaba reference)under ~9 min
Wan 2.2 A14B (MoE, FP8/Q8 + offload)longer — depends on offload setup
HunyuanVideo Q4 (comparison)12-20 min
Mochi (comparison)5-8 min

The TI2V-5B "under 9 minutes" figure is Alibaba's published single-GPU reference; the A14B models are heavier and their time depends heavily on your quant + offloading configuration. The HunyuanVideo and Mochi rows are internal comparison runs. For 7900 XTX: expect noticeably slower than RTX 4090. For Apple Silicon via MPS: the 5B model runs but several times slower than a 4090.


Tuning Recipes {#tuning}

RTX 4090 / 5090 (best quality)

A14B at FP8/Q8 with block offload (run through Kijai's WanVideoWrapper for efficient MoE routing) + ~30 steps. Or TI2V-5B at FP16 for the fast path.

RTX 4070 / 4080 (16 GB)

TI2V-5B (FP8 or Q-quant) + offload the UMT5 text encoder to CPU.

RTX 3060 / 4060 (8-12 GB)

Use a Q5 GGUF of TI2V-5B — slower and lower quality but viable. (There is no Wan 2.2 1.3B model; that was Wan 2.1.)

Apple Silicon

MPS-compatible PyTorch + TI2V-5B on an M-series chip with enough unified memory. Expect several times slower than a 4090.


Licensing {#licensing}

Wan 2.2 ships under the Apache 2.0 license — free for commercial use and redistribution, with no copyleft and no restriction on training other models. The model card adds an acceptable-use note against harmful or deceptive applications. Always confirm the current license on the Wan-AI HuggingFace model card before deploying commercially, since terms can differ between variants and future releases.

For other permissively-licensed video models: OpenSora and CogVideoX.


Troubleshooting {#troubleshooting}

SymptomCauseFix
OOM at VAE decodeTiled VAE not enabledUse VAE Decode (Tiled) node
Workflow won't loadMissing custom nodesInstall via Manager → Install Missing
Black outputNaN VAEUse --no-half-vae or fp32 VAE
Jittery motionToo few stepsIncrease to 35-40 steps
Repeating framesWrong schedulerUse sgm_uniform with dpmpp_2m
First/last frame mismatchI2V CLIP Vision offConnect CLIP Vision Encode
Slow on 7900 XTXFlashAttention not builtBuild FA-2 ROCm fork

FAQ {#faq}

See answers to common Wan 2.2 questions below.


Sources: Wan-Video GitHub | ComfyUI-WanVideoWrapper | city96 GGUF quants | Internal benchmarks RTX 4090, RX 7900 XTX, M4 Max.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes Wan 2.2 + ComfyUI video generation reference. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators