HunyuanVideo Local Setup Guide (2026): Tencent's Best Open Video Model
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
HunyuanVideo is the highest-quality open-source video model in 2026. Tencent's 13B-parameter DiT generates 5-15 second clips at 720p / 24fps with cinematic motion coherence that rivals closed-source Sora and Runway Gen-3. The price: 40 GB VRAM unquantized. Community Q4 GGUF brings it to 24 GB on consumer cards, but tightly. For RTX 5090 / Pro W7900 / multi-GPU users who want maximum quality, this is the right tool.
This guide covers everything: model variants (base T2V, I2V, FastVideo), GGUF quantization choices, ComfyUI installation with HunyuanVideoWrapper, prompt engineering for cinematic shots, LoRA training via Musubi Tuner, frame interpolation and upscaling for delivery, and benchmarks vs Wan 2.2 / Mochi / closed APIs.
Table of Contents
- What HunyuanVideo Is
- Variants: Base, I2V, FastVideo
- Hardware Requirements
- HunyuanVideo vs Wan 2.2 vs Mochi
- Installation: ComfyUI + HunyuanVideoWrapper
- Downloading Models and Encoders
- GGUF Quantization Options
- Your First Text-to-Video
- Image-to-Video Workflow
- Prompt Engineering for Cinematic Output
- LoRA Training via Musubi Tuner
- Frame-Pack and Long-Clip Generation
- FastVideo Distillation
- Multi-GPU for Lower Latency
- Frame Interpolation and Upscaling
- Performance Benchmarks
- Tuning Recipes
- Licensing
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
What HunyuanVideo Is {#what-it-is}
HunyuanVideo (Tencent/HunyuanVideo on GitHub) is a 13B-parameter video diffusion transformer released by Tencent in December 2024. It uses:
- 3D Causal VAE — temporally aware encoding/decoding for motion coherence
- MMDiT-style DiT backbone — proven architecture from Stable Diffusion 3 / Flux
- T5 + LLaMA dual text encoders — richer prompt understanding than CLIP-based models
- Native 720p / 24fps generation up to 129 frames (~5.4 sec)
License: Tencent Hunyuan Community License (permissive with EU restrictions and competitor-training restriction).
Variants: Base, I2V, FastVideo {#variants}
| Variant | Params | Use | VRAM (BF16) |
|---|---|---|---|
| HunyuanVideo (base T2V) | 13B | Text-to-video | 40 GB |
| HunyuanVideo-I2V | 13B | Image-to-video | 40 GB |
| Hunyuan-FastVideo (distilled) | 13B | Faster T2V | 40 GB |
| HunyuanVideo Q4 GGUF | 13B | Tight VRAM | 24 GB |
| HunyuanVideo Q8 GGUF | 13B | Quality + tighter | 30 GB |
For consumer 24 GB GPUs: Q4 GGUF (tight, no headroom) or wait for FP8 official release.
Hardware Requirements {#requirements}
| GPU | Variant |
|---|---|
| RTX 4090 / 7900 XTX (24 GB) | Q4 GGUF only — tight |
| RTX 5090 (32 GB) | Q8 GGUF comfortable |
| RTX A6000 / Pro W7900 (48 GB) | BF16 native |
| H100 (80 GB) | BF16 + long context |
| Mac Studio M4 Max 128GB | BF16 (slow but works) |
| Multi-GPU 2x 4090 | BF16 with model split |
System RAM 32 GB+, 64 GB recommended. Disk 50-100 GB for models + workflows.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
HunyuanVideo vs Wan 2.2 vs Mochi {#comparison}
| Aspect | HunyuanVideo | Wan 2.2 14B | Mochi 10B |
|---|---|---|---|
| Quality | Best | Excellent | Good |
| Min VRAM | Q4: 24 GB | Q8: 22 GB | 16 GB |
| Render time (5sec/720p, RTX 4090) | 12-20 min | 6-8 min | 5-8 min |
| Long-clip coherence | Best | Good | Limited |
| Cinematic feel | Best | Excellent | Good |
| LoRA ecosystem | Growing | Growing | Smaller |
| License | Permissive (with EU caveat) | Permissive | Apache 2.0 |
For maximum quality with 32+ GB VRAM: HunyuanVideo. For 24 GB consumer GPUs with comfort: Wan 2.2 Q8. For lowest VRAM / Apache: Mochi.
Installation: ComfyUI + HunyuanVideoWrapper {#installation}
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
cd ComfyUI-HunyuanVideoWrapper
pip install -r requirements.txt
# For GGUF support
cd ../
git clone https://github.com/city96/ComfyUI-GGUF
Restart ComfyUI. Hunyuan nodes appear under HunyuanVideo category.
Downloading Models and Encoders {#downloading}
Base T2V (BF16)
huggingface-cli download tencent/HunyuanVideo \
--local-dir ComfyUI/models/diffusion_models/hunyuan
Q4 GGUF (24 GB GPU friendly)
huggingface-cli download city96/HunyuanVideo-gguf \
hunyuan-video-t2v-720p-Q4_K_S.gguf \
--local-dir ComfyUI/models/diffusion_models
Text encoders
huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
split_files/text_encoders/llava_llama3_fp8_scaled.safetensors \
split_files/text_encoders/clip_l.safetensors \
--local-dir ComfyUI/models/text_encoders
huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
split_files/vae/hunyuan_video_vae_bf16.safetensors \
--local-dir ComfyUI/models/vae
GGUF Quantization Options {#gguf}
| Quant | VRAM | Quality | Time (5sec/720p, RTX 4090) |
|---|---|---|---|
| BF16 | 40 GB | Reference | 12 min |
| FP8 | 25 GB | ~99% of BF16 | 13 min |
| Q8_0 | 22 GB | ~99% of BF16 | 14 min |
| Q5_K_M | 16 GB | ~95% of BF16 | 13 min |
| Q4_K_S | 12 GB | ~90% of BF16 | 12 min |
For RTX 4090 24 GB: Q5_K_M sweet spot (room for KV cache and longer clips). Q4 works but tight. Q8 is too tight on 24 GB once you add overheads.
Your First Text-to-Video {#first-t2v}
Load example_workflow.json from ComfyUI-HunyuanVideoWrapper/example_workflows/:
[UNet Loader (GGUF)] → MODEL (hunyuan-video-t2v-Q5_K_M.gguf)
[DualCLIPLoader] → CLIP (llava_llama3 + clip_l)
[Load VAE] → VAE (hunyuan_video_vae_bf16)
[CLIP Text Encode] → CONDITIONING positive
[CLIP Text Encode] → CONDITIONING negative
[Empty Hunyuan Latent] → LATENT (1280x720, 73 frames)
[KSampler] (30 steps, cfg 6.0, dpmpp_2m_sde, sgm_uniform)
[VAE Decode (Tiled)] → IMAGES
[VHS_VideoCombine] → MP4 (24fps)
Click Queue Prompt. RTX 4090 Q5: ~12-15 min for 3-second 720p output.
Image-to-Video Workflow {#i2v}
For HunyuanVideo-I2V:
[Load Image] → IMAGE (start frame, 720x1280 or 1280x720)
[Load Hunyuan I2V Model] → MODEL
[CLIP Vision Encode] → image conditioning
[HunyuanI2VLatent] → LATENT
[KSampler] (30 steps)
[VAE Decode]
The start image strongly drives composition; prompt drives motion and atmosphere. Use sharp, well-composed inputs at the same aspect ratio as your output.
Prompt Engineering for Cinematic Output {#prompts}
HunyuanVideo responds best to detailed cinematographic prompts:
Professional cinematic shot of [subject] [action]. [Camera move] from [angle].
[Lens type], [aperture]. [Lighting]. [Atmosphere]. [Color grading].
[Time of day]. [Style references].
Example:
Professional cinematic dolly-in of a samurai sheathing his katana, slow-motion, 35mm anamorphic lens, f/2.8, golden-hour rim lighting, atmospheric haze, teal-and-orange color grade, dawn, in the style of Akira Kurosawa cinematography.
Negative prompt:
blurry, deformed, jittery, watermark, text overlay, low quality, harsh artifacts, frame duplication, oversaturated
For LoRA-conditioned generation, include the LoRA trigger word in the prompt.
LoRA Training via Musubi Tuner {#lora-training}
git clone https://github.com/kohya-ss/musubi-tuner
cd musubi-tuner
pip install -r requirements.txt
Dataset format: 10-50 short video clips (3-5 seconds each) at 720p, paired with caption .txt files.
python hv_train_network.py \
--pretrained_model_name_or_path /path/to/hunyuan-video \
--dit_dtype bf16 \
--dataset_config /path/to/config.toml \
--network_module networks.lora \
--network_dim 32 --network_alpha 16 \
--learning_rate 1e-4 \
--output_dir ./lora_output \
--max_train_steps 4000
Time: ~6-12 hours on RTX 4090 for a 4000-step LoRA. Result: my_lora.safetensors loadable in ComfyUI workflows.
For character / style LoRAs: 20-30 reference clips covering varied angles, lighting, expressions.
Frame-Pack and Long-Clip Generation {#frame-pack}
HunyuanVideo natively supports up to 129 frames (~5.4 sec at 24 fps). For longer:
- Frame-pack composition (built into wrapper): chain shorter generations with frame overlap for smoother joins.
- HunyuanVideo-LongContext (community fine-tune): up to ~10-second native generation at the cost of higher VRAM.
- Last-frame chaining: extract last frame, use as I2V input for next segment.
For narrative work, treat HunyuanVideo as a per-shot tool — generate 5-second shots covering your storyboard, edit together.
FastVideo Distillation {#fastvideo}
FastVideo (hao-ai-lab/FastVideo) is a distilled variant from HunyuanVideo that runs 4-8x faster with slight quality loss:
huggingface-cli download FastVideo/FastHunyuan \
--local-dir ComfyUI/models/diffusion_models
Render time on RTX 4090: ~3-5 min for 5-second 720p (vs 12-20 min base). Quality is ~90-95% of base Hunyuan; for iteration / drafting it's the right choice. For final renders use base.
Multi-GPU for Lower Latency {#multi-gpu}
For 2x RTX 4090 (48 GB total):
# In ComfyUI-MultiGPU node configuration
diffusion_model_device: cuda:0
text_encoder_device: cuda:1 # offload T5 to second GPU
vae_device: cuda:0
Result: BF16 fits across two GPUs without quantization. Render time ~10 min on 2x 4090 vs 20 min on single 4090 Q4. For heavy production work, dual-GPU is the practical target.
Frame Interpolation and Upscaling {#post-process}
24fps → 60fps via RIFE:
[VAE Decode] → IMAGES → [RIFE VFI] → IMAGES (60fps)
720p → 1440p / 4K via Real-ESRGAN x2 or Topaz Video AI:
[Upscale Image (using Model)] (Real-ESRGAN x2 or x4-UltraSharp)
Combined render+interp+upscale on RTX 4090: ~25-35 min for a 5-second 1440p 60fps clip.
Performance Benchmarks {#benchmarks}
5-second 720p clip, 30 steps, RTX 4090 Q5_K_M:
| Stage | Time |
|---|---|
| Generation | 13 min |
| RIFE 24→60fps | 1 min |
| Real-ESRGAN x2 | 4 min |
| Total | ~18 min |
For 1280x720 → 3840x2160 (4K) via Real-ESRGAN x4: add ~10 min.
For batch generation (10 shots overnight): set up workflow loop with seed iteration; plan ~3-4 hours per 10 shots at 720p.
Tuning Recipes {#tuning}
RTX 4090 (best quality consumer)
Q5_K_M GGUF + 30 steps + dpmpp_2m_sde + sgm_uniform + cfg 6.0.
RTX 5090 32 GB
Q8 GGUF or FP8 + 30 steps + slightly tighter sampler for sharper output.
Pro W7900 / A6000 48 GB
BF16 + 40 steps + extended frame count (97 frames = 4 sec at 24fps).
Mac Studio M4 Max 128 GB
BF16 + 30 steps. Slow (4-5x RTX 4090) but works for occasional renders.
Licensing {#licensing}
Tencent Hunyuan Community License. Permissive for most commercial use; explicit restrictions:
- Cannot use to train competitor video models
- Cannot deploy in EU member states without separate Tencent agreement
- Must preserve Tencent attribution in derivative works
For EU-based commercial deployment, switch to Mochi (Apache 2.0) or CogVideoX (commercial-friendly).
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM at sampling | VRAM too tight | Lower frame count or use Q4 |
| OOM at VAE decode | Tiled VAE not enabled | Use Tiled VAE Decode |
| Black output | NaN in fp16 VAE | Use bf16 VAE explicitly |
| Workflow won't load | Wrapper version mismatch | Update wrapper + restart |
| Slow on 7900 XTX | FlashAttention build | Use ROCm FA-2 fork |
| Multi-GPU error | shm too small | Increase --shm-size to 16+ GB |
| Jittery motion | Too few steps | Increase to 35-40 |
FAQ {#faq}
See answers to common HunyuanVideo questions below.
Sources: HunyuanVideo GitHub | ComfyUI-HunyuanVideoWrapper | Musubi Tuner | city96 GGUF quants | Internal benchmarks RTX 4090, RTX 5090, Pro W7900.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!