HunyuanVideo 1.5 Local Setup Guide (2026): Tencent's Lightweight 8.3B Open Video Model
Want to go deeper than this article?
Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Go from reading about AI to building with AI 20 structured courses. Hands-on projects. Runs on your machine. Start free.
HunyuanVideo 1.5 changes the equation for local video. On November 20, 2025, Tencent released an 8.3B-parameter rebuild of HunyuanVideo that runs on consumer GPUs — the official repo lists a 14 GB VRAM minimum with model offloading, and ComfyUI's native support targets 24 GB cards. That is a dramatic drop from the original December-2024 13B model, which effectively needed ~40 GB or a multi-GPU rig. Despite the smaller size, Tencent reports stronger quality and motion coherence, driven by a new attention mechanism (SSTA), glyph-aware text encoding, and a built-in 1080p super-resolution stage.
This guide leads with HunyuanVideo 1.5 — what it is, the VRAM picture, native ComfyUI setup, the model files you need, and how it compares to the original 13B — then keeps the full original-13B reference (variants, GGUF quants, the kijai wrapper, LoRA training, multi-GPU) for anyone still running that pipeline.
Table of Contents
- HunyuanVideo 1.5 — What Changed
- HunyuanVideo 1.5: Specs and VRAM
- HunyuanVideo 1.5 in ComfyUI (Native)
- 1.5 vs the Original 13B
- The Original HunyuanVideo (13B): What It Is
- 13B Variants: Base, I2V, FastVideo
- 13B Hardware Requirements
- HunyuanVideo vs Wan 2.2 vs Mochi
- 13B Install: ComfyUI + HunyuanVideoWrapper
- Downloading 13B Models and Encoders
- 13B GGUF Quantization Options
- Your First Text-to-Video (13B)
- Image-to-Video Workflow
- Prompt Engineering for Cinematic Output
- LoRA Training
- Long-Clip Generation
- FastVideo Distillation (13B)
- Multi-GPU for Lower Latency
- Frame Interpolation and Upscaling
- Performance Benchmarks (13B)
- Tuning Recipes
- Licensing
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
HunyuanVideo 1.5 — What Changed {#v15}
HunyuanVideo 1.5 (Tencent-Hunyuan/HunyuanVideo-1.5 on GitHub) shipped on November 20, 2025 with inference code and model weights, followed by official LoRA training scripts around December 5, 2025. It is a unified model that does both text-to-video and image-to-video from one architecture.
What is genuinely new versus the original 13B:
- 8.3B parameters (down from 13B) — yet Tencent reports better visual quality and motion.
- SSTA — Selective and Sliding Tile Attention — prunes redundant spatiotemporal key/value blocks, which Tencent credits for roughly doubling inference speed over the predecessor.
- Glyph-aware text encoding — a byT5 glyph encoder alongside a Qwen 2.5-VL text encoder, improving in-video text (Chinese and English) and prompt fidelity.
- Built-in cascaded super-resolution — a separate few-step distilled network upscales native output to 1080p, so HD delivery is part of the model, not a bolt-on upscaler.
- Consumer-GPU VRAM — the headline. See the next section.
If you only read one section, read this: the original 13B was a "big-GPU only" model; 1.5 is the version most local creators can actually run.
HunyuanVideo 1.5: Specs and VRAM {#v15-specs}
| Spec | HunyuanVideo 1.5 |
|---|---|
| Parameters | 8.3B |
| Released | November 20, 2025 (weights + code) |
| Modes | Text-to-video and image-to-video (unified) |
| Native resolution | 480p and 720p |
| Frame rate | 24 fps |
| Default length | 121 frames (~5 sec); longer durations supported |
| Super-resolution | Built-in distilled network → 1080p |
| Text encoders | Qwen 2.5-VL 7B + byT5 (glyph-aware) |
| Attention | SSTA (Selective and Sliding Tile Attention) |
| Min VRAM | 14 GB with model offloading enabled (official repo) |
| Recommended VRAM | 24 GB (ComfyUI native target, comfortable headroom) |
| License | Tencent Hunyuan Community License (excludes EU / UK / South Korea) |
On VRAM: Tencent's repo states 14 GB is the floor with offloading on — disabling offloading runs faster but needs more memory. ComfyUI's native implementation recommends 24 GB consumer cards because you are also loading the Qwen 2.5-VL encoder, the byT5 glyph encoder, the VAE, and the separate super-resolution model in addition to the diffusion transformer. Community offloading frameworks (e.g. Wan2GP) have reported pushing it onto cards as small as 6 GB with heavy optimization and a speed penalty — useful to know, but 14-24 GB is the practical range.
HunyuanVideo 1.5 in ComfyUI (Native) {#v15-comfyui}
ComfyUI added native HunyuanVideo 1.5 support in late November 2025 — there is no third-party wrapper to install for 1.5. Update ComfyUI to a current build, then load the bundled Text-to-Video or Image-to-Video 720p template.
You need five files, placed in their matching folders:
# Diffusion model (text-to-video) → ComfyUI/models/diffusion_models
hunyuanvideo1.5_720p_t2v_fp16.safetensors
# Super-resolution / upscale model → ComfyUI/models/diffusion_models
hunyuanvideo1.5_1080p_sr_distilled_fp16.safetensors
# Text encoders → ComfyUI/models/text_encoders
qwen_2.5_vl_7b_fp8_scaled.safetensors
byt5_small_glyphxl_fp16.safetensors
# VAE → ComfyUI/models/vae
hunyuanvideo15_vae_fp16.safetensors
Tencent provides official FP8 and FP16 weights. The ComfyUI templates wire these into a two-stage graph: generate at 720p, then pass through the distilled super-resolution model to 1080p. Both T2V and I2V templates ship in the box; the I2V one adds a Load Image node feeding the start frame.
Note: at the time of writing, Tencent's official release shipped FP8/FP16 weights and ComfyUI native nodes. Community GGUF quants for 1.5 may exist by the time you read this — check city96 on HuggingFace — but they were not part of the official launch, so confirm the source before downloading.
Reading articles is good. Building is better.
Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
1.5 vs the Original 13B {#v15-vs-13b}
| Aspect | HunyuanVideo 1.5 (Nov 2025) | Original HunyuanVideo 13B (Dec 2024) |
|---|---|---|
| Parameters | 8.3B | 13B |
| Min VRAM | 14 GB (offloaded) / 24 GB comfortable | ~40 GB BF16; ~24 GB via community Q4 GGUF (tight) |
| Native resolution | 480p / 720p | 720p |
| Native upscaling | Built-in → 1080p | None (use Real-ESRGAN / Topaz separately) |
| ComfyUI support | Native (late Nov 2025) | Via kijai's HunyuanVideoWrapper |
| Text encoders | Qwen 2.5-VL + byT5 (glyph-aware) | T5 + LLaMA |
| Attention | SSTA (faster) | Standard DiT attention |
| LoRA training | Official Tencent scripts (Dec 2025) | Musubi Tuner / community |
| Best for | Most consumer GPUs (14-24 GB) | Existing 40 GB / multi-GPU rigs + mature LoRA library |
Bottom line: if you are starting fresh on a 14-24 GB consumer card, use 1.5. If you already run a working 40 GB or multi-GPU 13B pipeline with trained LoRAs and a tuned workflow, there is no urgent reason to migrate — but new work should target 1.5. The rest of this guide documents the original 13B pipeline for those users.
The Original HunyuanVideo (13B): What It Is {#what-it-is}
HunyuanVideo (Tencent/HunyuanVideo on GitHub) is the original 13B-parameter video diffusion transformer released by Tencent in December 2024. It uses:
- 3D Causal VAE — temporally aware encoding/decoding for motion coherence
- MMDiT-style DiT backbone — proven architecture from Stable Diffusion 3 / Flux
- T5 + LLaMA dual text encoders — richer prompt understanding than CLIP-based models
- Native 720p / 24fps generation up to 129 frames (~5.4 sec)
License: Tencent Hunyuan Community License (permissive with EU/UK/South Korea exclusion and competitor-training restriction).
This is the model the rest of this guide covers in depth. It is still capable and widely deployed, but for new consumer-GPU setups, HunyuanVideo 1.5 (above) is the recommended starting point.
13B Variants: Base, I2V, FastVideo {#variants}
| Variant | Params | Use | VRAM (BF16) |
|---|---|---|---|
| HunyuanVideo (base T2V) | 13B | Text-to-video | 40 GB |
| HunyuanVideo-I2V | 13B | Image-to-video | 40 GB |
| Hunyuan-FastVideo (distilled) | 13B | Faster T2V | 40 GB |
| HunyuanVideo Q4 GGUF | 13B | Tight VRAM | 24 GB |
| HunyuanVideo Q8 GGUF | 13B | Quality + tighter | 30 GB |
For consumer 24 GB GPUs: Q4 GGUF (tight, no headroom) or wait for FP8 official release.
Hardware Requirements {#requirements}
| GPU | Variant |
|---|---|
| RTX 4090 / 7900 XTX (24 GB) | Q4 GGUF only — tight |
| RTX 5090 (32 GB) | Q8 GGUF comfortable |
| RTX A6000 / Pro W7900 (48 GB) | BF16 native |
| H100 (80 GB) | BF16 + long context |
| Mac Studio M4 Max 128GB | BF16 (slow but works) |
| Multi-GPU 2x 4090 | BF16 with model split |
System RAM 32 GB+, 64 GB recommended. Disk 50-100 GB for models + workflows.
HunyuanVideo vs Wan 2.2 vs Mochi {#comparison}
This table covers the original 13B HunyuanVideo. For HunyuanVideo 1.5's lower VRAM floor (14-24 GB), see the 1.5 vs 13B section above — 1.5 changes the VRAM column dramatically.
| Aspect | HunyuanVideo 13B | Wan 2.2 14B | Mochi 10B |
|---|---|---|---|
| Min VRAM | Q4: 24 GB (tight) | Q8: 22 GB | 16 GB |
| Render time (5sec/720p, RTX 4090) | 12-20 min | 6-8 min | 5-8 min |
| Long-clip coherence | Strong | Good | Limited |
| LoRA ecosystem | Mature | Growing | Smaller |
| License | Tencent Community (EU/UK/KR excluded) | Permissive | Apache 2.0 |
For the lowest VRAM floor among the Hunyuan family, use HunyuanVideo 1.5 (8.3B). For lowest VRAM with an Apache 2.0 license, use Mochi. We avoid declaring a single "best quality" winner here — relative quality shifts with each point release, and the honest differentiators are VRAM, license, and ecosystem.
Installation: ComfyUI + HunyuanVideoWrapper {#installation}
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
cd ComfyUI-HunyuanVideoWrapper
pip install -r requirements.txt
# For GGUF support
cd ../
git clone https://github.com/city96/ComfyUI-GGUF
Restart ComfyUI. Hunyuan nodes appear under HunyuanVideo category.
Downloading Models and Encoders {#downloading}
Base T2V (BF16)
huggingface-cli download tencent/HunyuanVideo \
--local-dir ComfyUI/models/diffusion_models/hunyuan
Q4 GGUF (24 GB GPU friendly)
huggingface-cli download city96/HunyuanVideo-gguf \
hunyuan-video-t2v-720p-Q4_K_S.gguf \
--local-dir ComfyUI/models/diffusion_models
Text encoders
huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
split_files/text_encoders/llava_llama3_fp8_scaled.safetensors \
split_files/text_encoders/clip_l.safetensors \
--local-dir ComfyUI/models/text_encoders
huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
split_files/vae/hunyuan_video_vae_bf16.safetensors \
--local-dir ComfyUI/models/vae
GGUF Quantization Options {#gguf}
| Quant | VRAM | Quality | Time (5sec/720p, RTX 4090) |
|---|---|---|---|
| BF16 | 40 GB | Reference | 12 min |
| FP8 | 25 GB | ~99% of BF16 | 13 min |
| Q8_0 | 22 GB | ~99% of BF16 | 14 min |
| Q5_K_M | 16 GB | ~95% of BF16 | 13 min |
| Q4_K_S | 12 GB | ~90% of BF16 | 12 min |
For RTX 4090 24 GB: Q5_K_M sweet spot (room for KV cache and longer clips). Q4 works but tight. Q8 is too tight on 24 GB once you add overheads.
Your First Text-to-Video {#first-t2v}
Load example_workflow.json from ComfyUI-HunyuanVideoWrapper/example_workflows/:
[UNet Loader (GGUF)] → MODEL (hunyuan-video-t2v-Q5_K_M.gguf)
[DualCLIPLoader] → CLIP (llava_llama3 + clip_l)
[Load VAE] → VAE (hunyuan_video_vae_bf16)
[CLIP Text Encode] → CONDITIONING positive
[CLIP Text Encode] → CONDITIONING negative
[Empty Hunyuan Latent] → LATENT (1280x720, 73 frames)
[KSampler] (30 steps, cfg 6.0, dpmpp_2m_sde, sgm_uniform)
[VAE Decode (Tiled)] → IMAGES
[VHS_VideoCombine] → MP4 (24fps)
Click Queue Prompt. RTX 4090 Q5: ~12-15 min for 3-second 720p output.
Image-to-Video Workflow {#i2v}
For HunyuanVideo-I2V:
[Load Image] → IMAGE (start frame, 720x1280 or 1280x720)
[Load Hunyuan I2V Model] → MODEL
[CLIP Vision Encode] → image conditioning
[HunyuanI2VLatent] → LATENT
[KSampler] (30 steps)
[VAE Decode]
The start image strongly drives composition; prompt drives motion and atmosphere. Use sharp, well-composed inputs at the same aspect ratio as your output.
Prompt Engineering for Cinematic Output {#prompts}
HunyuanVideo responds best to detailed cinematographic prompts:
Professional cinematic shot of [subject] [action]. [Camera move] from [angle].
[Lens type], [aperture]. [Lighting]. [Atmosphere]. [Color grading].
[Time of day]. [Style references].
Example:
Professional cinematic dolly-in of a samurai sheathing his katana, slow-motion, 35mm anamorphic lens, f/2.8, golden-hour rim lighting, atmospheric haze, teal-and-orange color grade, dawn, in the style of Akira Kurosawa cinematography.
Negative prompt:
blurry, deformed, jittery, watermark, text overlay, low quality, harsh artifacts, frame duplication, oversaturated
For LoRA-conditioned generation, include the LoRA trigger word in the prompt.
LoRA Training via Musubi Tuner {#lora-training}
git clone https://github.com/kohya-ss/musubi-tuner
cd musubi-tuner
pip install -r requirements.txt
Dataset format: 10-50 short video clips (3-5 seconds each) at 720p, paired with caption .txt files.
python hv_train_network.py \
--pretrained_model_name_or_path /path/to/hunyuan-video \
--dit_dtype bf16 \
--dataset_config /path/to/config.toml \
--network_module networks.lora \
--network_dim 32 --network_alpha 16 \
--learning_rate 1e-4 \
--output_dir ./lora_output \
--max_train_steps 4000
Time: ~6-12 hours on RTX 4090 for a 4000-step LoRA. Result: my_lora.safetensors loadable in ComfyUI workflows.
For character / style LoRAs: 20-30 reference clips covering varied angles, lighting, expressions.
Frame-Pack and Long-Clip Generation {#frame-pack}
HunyuanVideo natively supports up to 129 frames (~5.4 sec at 24 fps). For longer:
- Frame-pack composition (built into wrapper): chain shorter generations with frame overlap for smoother joins.
- HunyuanVideo-LongContext (community fine-tune): up to ~10-second native generation at the cost of higher VRAM.
- Last-frame chaining: extract last frame, use as I2V input for next segment.
For narrative work, treat HunyuanVideo as a per-shot tool — generate 5-second shots covering your storyboard, edit together.
FastVideo Distillation {#fastvideo}
FastVideo (hao-ai-lab/FastVideo) is a distilled variant from HunyuanVideo that runs 4-8x faster with slight quality loss:
huggingface-cli download FastVideo/FastHunyuan \
--local-dir ComfyUI/models/diffusion_models
Render time on RTX 4090: ~3-5 min for 5-second 720p (vs 12-20 min base). Quality is ~90-95% of base Hunyuan; for iteration / drafting it's the right choice. For final renders use base.
Multi-GPU for Lower Latency {#multi-gpu}
For 2x RTX 4090 (48 GB total):
# In ComfyUI-MultiGPU node configuration
diffusion_model_device: cuda:0
text_encoder_device: cuda:1 # offload T5 to second GPU
vae_device: cuda:0
Result: BF16 fits across two GPUs without quantization. Render time ~10 min on 2x 4090 vs 20 min on single 4090 Q4. For heavy production work, dual-GPU is the practical target.
Frame Interpolation and Upscaling {#post-process}
24fps → 60fps via RIFE:
[VAE Decode] → IMAGES → [RIFE VFI] → IMAGES (60fps)
720p → 1440p / 4K via Real-ESRGAN x2 or Topaz Video AI:
[Upscale Image (using Model)] (Real-ESRGAN x2 or x4-UltraSharp)
Combined render+interp+upscale on RTX 4090: ~25-35 min for a 5-second 1440p 60fps clip.
Performance Benchmarks {#benchmarks}
5-second 720p clip, 30 steps, RTX 4090 Q5_K_M:
| Stage | Time |
|---|---|
| Generation | 13 min |
| RIFE 24→60fps | 1 min |
| Real-ESRGAN x2 | 4 min |
| Total | ~18 min |
For 1280x720 → 3840x2160 (4K) via Real-ESRGAN x4: add ~10 min.
For batch generation (10 shots overnight): set up workflow loop with seed iteration; plan ~3-4 hours per 10 shots at 720p.
Tuning Recipes {#tuning}
RTX 4090 (best quality consumer)
Q5_K_M GGUF + 30 steps + dpmpp_2m_sde + sgm_uniform + cfg 6.0.
RTX 5090 32 GB
Q8 GGUF or FP8 + 30 steps + slightly tighter sampler for sharper output.
Pro W7900 / A6000 48 GB
BF16 + 40 steps + extended frame count (97 frames = 4 sec at 24fps).
Mac Studio M4 Max 128 GB
BF16 + 30 steps. Slow (4-5x RTX 4090) but works for occasional renders.
Licensing {#licensing}
Both the original 13B HunyuanVideo and HunyuanVideo 1.5 ship under the Tencent Hunyuan Community License Agreement — NOT Apache 2.0 (several third-party write-ups incorrectly claimed 1.5 was Apache 2.0; the official LICENSE file is the Tencent Community license). Permissive for most commercial use, with explicit restrictions:
- Territory: The license explicitly does NOT apply in the European Union, the United Kingdom, or South Korea.
- Scale cap: Above 100 million monthly active users, you must request a separate license from Tencent (granted at its discretion).
- Competitor training: Cannot use outputs to train competing AI models.
- Disclosure / attribution: Must disclose you are the service provider and preserve Tencent attribution; the license is governed by Hong Kong law.
For deployment in the EU, UK, or South Korea, or for an unambiguous Apache 2.0 license, switch to Mochi (Apache 2.0) or CogVideoX. Always read the LICENSE in the official repo before commercial use.
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| OOM at sampling | VRAM too tight | Lower frame count or use Q4 |
| OOM at VAE decode | Tiled VAE not enabled | Use Tiled VAE Decode |
| Black output | NaN in fp16 VAE | Use bf16 VAE explicitly |
| Workflow won't load | Wrapper version mismatch | Update wrapper + restart |
| Slow on 7900 XTX | FlashAttention build | Use ROCm FA-2 fork |
| Multi-GPU error | shm too small | Increase --shm-size to 16+ GB |
| Jittery motion | Too few steps | Increase to 35-40 |
FAQ {#faq}
See answers to common HunyuanVideo questions below.
Sources: HunyuanVideo 1.5 GitHub (Tencent) | HunyuanVideo 1.5 — ComfyUI native support | HunyuanVideo 1.5 model card (HuggingFace) | Original HunyuanVideo 13B GitHub | ComfyUI-HunyuanVideoWrapper | Musubi Tuner | city96 GGUF quants | Internal benchmarks RTX 4090, RTX 5090, Pro W7900.
Related guides:
Go from reading about AI to building with AI
20 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 20 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.
Want structured AI education?
20 courses, 495+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!