★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Image Generation

HunyuanVideo Local Setup Guide (2026): Tencent's Best Open Video Model

May 1, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

HunyuanVideo is the highest-quality open-source video model in 2026. Tencent's 13B-parameter DiT generates 5-15 second clips at 720p / 24fps with cinematic motion coherence that rivals closed-source Sora and Runway Gen-3. The price: 40 GB VRAM unquantized. Community Q4 GGUF brings it to 24 GB on consumer cards, but tightly. For RTX 5090 / Pro W7900 / multi-GPU users who want maximum quality, this is the right tool.

This guide covers everything: model variants (base T2V, I2V, FastVideo), GGUF quantization choices, ComfyUI installation with HunyuanVideoWrapper, prompt engineering for cinematic shots, LoRA training via Musubi Tuner, frame interpolation and upscaling for delivery, and benchmarks vs Wan 2.2 / Mochi / closed APIs.

Table of Contents

  1. What HunyuanVideo Is
  2. Variants: Base, I2V, FastVideo
  3. Hardware Requirements
  4. HunyuanVideo vs Wan 2.2 vs Mochi
  5. Installation: ComfyUI + HunyuanVideoWrapper
  6. Downloading Models and Encoders
  7. GGUF Quantization Options
  8. Your First Text-to-Video
  9. Image-to-Video Workflow
  10. Prompt Engineering for Cinematic Output
  11. LoRA Training via Musubi Tuner
  12. Frame-Pack and Long-Clip Generation
  13. FastVideo Distillation
  14. Multi-GPU for Lower Latency
  15. Frame Interpolation and Upscaling
  16. Performance Benchmarks
  17. Tuning Recipes
  18. Licensing
  19. Troubleshooting
  20. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

What HunyuanVideo Is {#what-it-is}

HunyuanVideo (Tencent/HunyuanVideo on GitHub) is a 13B-parameter video diffusion transformer released by Tencent in December 2024. It uses:

  • 3D Causal VAE — temporally aware encoding/decoding for motion coherence
  • MMDiT-style DiT backbone — proven architecture from Stable Diffusion 3 / Flux
  • T5 + LLaMA dual text encoders — richer prompt understanding than CLIP-based models
  • Native 720p / 24fps generation up to 129 frames (~5.4 sec)

License: Tencent Hunyuan Community License (permissive with EU restrictions and competitor-training restriction).


Variants: Base, I2V, FastVideo {#variants}

VariantParamsUseVRAM (BF16)
HunyuanVideo (base T2V)13BText-to-video40 GB
HunyuanVideo-I2V13BImage-to-video40 GB
Hunyuan-FastVideo (distilled)13BFaster T2V40 GB
HunyuanVideo Q4 GGUF13BTight VRAM24 GB
HunyuanVideo Q8 GGUF13BQuality + tighter30 GB

For consumer 24 GB GPUs: Q4 GGUF (tight, no headroom) or wait for FP8 official release.


Hardware Requirements {#requirements}

GPUVariant
RTX 4090 / 7900 XTX (24 GB)Q4 GGUF only — tight
RTX 5090 (32 GB)Q8 GGUF comfortable
RTX A6000 / Pro W7900 (48 GB)BF16 native
H100 (80 GB)BF16 + long context
Mac Studio M4 Max 128GBBF16 (slow but works)
Multi-GPU 2x 4090BF16 with model split

System RAM 32 GB+, 64 GB recommended. Disk 50-100 GB for models + workflows.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

HunyuanVideo vs Wan 2.2 vs Mochi {#comparison}

AspectHunyuanVideoWan 2.2 14BMochi 10B
QualityBestExcellentGood
Min VRAMQ4: 24 GBQ8: 22 GB16 GB
Render time (5sec/720p, RTX 4090)12-20 min6-8 min5-8 min
Long-clip coherenceBestGoodLimited
Cinematic feelBestExcellentGood
LoRA ecosystemGrowingGrowingSmaller
LicensePermissive (with EU caveat)PermissiveApache 2.0

For maximum quality with 32+ GB VRAM: HunyuanVideo. For 24 GB consumer GPUs with comfort: Wan 2.2 Q8. For lowest VRAM / Apache: Mochi.


Installation: ComfyUI + HunyuanVideoWrapper {#installation}

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
cd ComfyUI-HunyuanVideoWrapper
pip install -r requirements.txt

# For GGUF support
cd ../
git clone https://github.com/city96/ComfyUI-GGUF

Restart ComfyUI. Hunyuan nodes appear under HunyuanVideo category.


Downloading Models and Encoders {#downloading}

Base T2V (BF16)

huggingface-cli download tencent/HunyuanVideo \
    --local-dir ComfyUI/models/diffusion_models/hunyuan

Q4 GGUF (24 GB GPU friendly)

huggingface-cli download city96/HunyuanVideo-gguf \
    hunyuan-video-t2v-720p-Q4_K_S.gguf \
    --local-dir ComfyUI/models/diffusion_models

Text encoders

huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
    split_files/text_encoders/llava_llama3_fp8_scaled.safetensors \
    split_files/text_encoders/clip_l.safetensors \
    --local-dir ComfyUI/models/text_encoders

huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
    split_files/vae/hunyuan_video_vae_bf16.safetensors \
    --local-dir ComfyUI/models/vae

GGUF Quantization Options {#gguf}

QuantVRAMQualityTime (5sec/720p, RTX 4090)
BF1640 GBReference12 min
FP825 GB~99% of BF1613 min
Q8_022 GB~99% of BF1614 min
Q5_K_M16 GB~95% of BF1613 min
Q4_K_S12 GB~90% of BF1612 min

For RTX 4090 24 GB: Q5_K_M sweet spot (room for KV cache and longer clips). Q4 works but tight. Q8 is too tight on 24 GB once you add overheads.


Your First Text-to-Video {#first-t2v}

Load example_workflow.json from ComfyUI-HunyuanVideoWrapper/example_workflows/:

[UNet Loader (GGUF)] → MODEL (hunyuan-video-t2v-Q5_K_M.gguf)
[DualCLIPLoader] → CLIP (llava_llama3 + clip_l)
[Load VAE] → VAE (hunyuan_video_vae_bf16)
[CLIP Text Encode] → CONDITIONING positive
[CLIP Text Encode] → CONDITIONING negative
[Empty Hunyuan Latent] → LATENT (1280x720, 73 frames)
[KSampler] (30 steps, cfg 6.0, dpmpp_2m_sde, sgm_uniform)
[VAE Decode (Tiled)] → IMAGES
[VHS_VideoCombine] → MP4 (24fps)

Click Queue Prompt. RTX 4090 Q5: ~12-15 min for 3-second 720p output.


Image-to-Video Workflow {#i2v}

For HunyuanVideo-I2V:

[Load Image] → IMAGE (start frame, 720x1280 or 1280x720)
[Load Hunyuan I2V Model] → MODEL
[CLIP Vision Encode] → image conditioning
[HunyuanI2VLatent] → LATENT
[KSampler] (30 steps)
[VAE Decode]

The start image strongly drives composition; prompt drives motion and atmosphere. Use sharp, well-composed inputs at the same aspect ratio as your output.


Prompt Engineering for Cinematic Output {#prompts}

HunyuanVideo responds best to detailed cinematographic prompts:

Professional cinematic shot of [subject] [action]. [Camera move] from [angle].
[Lens type], [aperture]. [Lighting]. [Atmosphere]. [Color grading].
[Time of day]. [Style references].

Example:

Professional cinematic dolly-in of a samurai sheathing his katana, slow-motion, 35mm anamorphic lens, f/2.8, golden-hour rim lighting, atmospheric haze, teal-and-orange color grade, dawn, in the style of Akira Kurosawa cinematography.

Negative prompt:

blurry, deformed, jittery, watermark, text overlay, low quality, harsh artifacts, frame duplication, oversaturated

For LoRA-conditioned generation, include the LoRA trigger word in the prompt.


LoRA Training via Musubi Tuner {#lora-training}

git clone https://github.com/kohya-ss/musubi-tuner
cd musubi-tuner
pip install -r requirements.txt

Dataset format: 10-50 short video clips (3-5 seconds each) at 720p, paired with caption .txt files.

python hv_train_network.py \
    --pretrained_model_name_or_path /path/to/hunyuan-video \
    --dit_dtype bf16 \
    --dataset_config /path/to/config.toml \
    --network_module networks.lora \
    --network_dim 32 --network_alpha 16 \
    --learning_rate 1e-4 \
    --output_dir ./lora_output \
    --max_train_steps 4000

Time: ~6-12 hours on RTX 4090 for a 4000-step LoRA. Result: my_lora.safetensors loadable in ComfyUI workflows.

For character / style LoRAs: 20-30 reference clips covering varied angles, lighting, expressions.


Frame-Pack and Long-Clip Generation {#frame-pack}

HunyuanVideo natively supports up to 129 frames (~5.4 sec at 24 fps). For longer:

  • Frame-pack composition (built into wrapper): chain shorter generations with frame overlap for smoother joins.
  • HunyuanVideo-LongContext (community fine-tune): up to ~10-second native generation at the cost of higher VRAM.
  • Last-frame chaining: extract last frame, use as I2V input for next segment.

For narrative work, treat HunyuanVideo as a per-shot tool — generate 5-second shots covering your storyboard, edit together.


FastVideo Distillation {#fastvideo}

FastVideo (hao-ai-lab/FastVideo) is a distilled variant from HunyuanVideo that runs 4-8x faster with slight quality loss:

huggingface-cli download FastVideo/FastHunyuan \
    --local-dir ComfyUI/models/diffusion_models

Render time on RTX 4090: ~3-5 min for 5-second 720p (vs 12-20 min base). Quality is ~90-95% of base Hunyuan; for iteration / drafting it's the right choice. For final renders use base.


Multi-GPU for Lower Latency {#multi-gpu}

For 2x RTX 4090 (48 GB total):

# In ComfyUI-MultiGPU node configuration
diffusion_model_device: cuda:0
text_encoder_device: cuda:1   # offload T5 to second GPU
vae_device: cuda:0

Result: BF16 fits across two GPUs without quantization. Render time ~10 min on 2x 4090 vs 20 min on single 4090 Q4. For heavy production work, dual-GPU is the practical target.


Frame Interpolation and Upscaling {#post-process}

24fps → 60fps via RIFE:

[VAE Decode] → IMAGES → [RIFE VFI] → IMAGES (60fps)

720p → 1440p / 4K via Real-ESRGAN x2 or Topaz Video AI:

[Upscale Image (using Model)] (Real-ESRGAN x2 or x4-UltraSharp)

Combined render+interp+upscale on RTX 4090: ~25-35 min for a 5-second 1440p 60fps clip.


Performance Benchmarks {#benchmarks}

5-second 720p clip, 30 steps, RTX 4090 Q5_K_M:

StageTime
Generation13 min
RIFE 24→60fps1 min
Real-ESRGAN x24 min
Total~18 min

For 1280x720 → 3840x2160 (4K) via Real-ESRGAN x4: add ~10 min.

For batch generation (10 shots overnight): set up workflow loop with seed iteration; plan ~3-4 hours per 10 shots at 720p.


Tuning Recipes {#tuning}

RTX 4090 (best quality consumer)

Q5_K_M GGUF + 30 steps + dpmpp_2m_sde + sgm_uniform + cfg 6.0.

RTX 5090 32 GB

Q8 GGUF or FP8 + 30 steps + slightly tighter sampler for sharper output.

Pro W7900 / A6000 48 GB

BF16 + 40 steps + extended frame count (97 frames = 4 sec at 24fps).

Mac Studio M4 Max 128 GB

BF16 + 30 steps. Slow (4-5x RTX 4090) but works for occasional renders.


Licensing {#licensing}

Tencent Hunyuan Community License. Permissive for most commercial use; explicit restrictions:

  • Cannot use to train competitor video models
  • Cannot deploy in EU member states without separate Tencent agreement
  • Must preserve Tencent attribution in derivative works

For EU-based commercial deployment, switch to Mochi (Apache 2.0) or CogVideoX (commercial-friendly).


Troubleshooting {#troubleshooting}

SymptomCauseFix
OOM at samplingVRAM too tightLower frame count or use Q4
OOM at VAE decodeTiled VAE not enabledUse Tiled VAE Decode
Black outputNaN in fp16 VAEUse bf16 VAE explicitly
Workflow won't loadWrapper version mismatchUpdate wrapper + restart
Slow on 7900 XTXFlashAttention buildUse ROCm FA-2 fork
Multi-GPU errorshm too smallIncrease --shm-size to 16+ GB
Jittery motionToo few stepsIncrease to 35-40

FAQ {#faq}

See answers to common HunyuanVideo questions below.


Sources: HunyuanVideo GitHub | ComfyUI-HunyuanVideoWrapper | Musubi Tuner | city96 GGUF quants | Internal benchmarks RTX 4090, RTX 5090, Pro W7900.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: May 1, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a HunyuanVideo + ComfyUI multi-GPU video reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators