★ Reading this for free? Get 20 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
Image Generation

HunyuanVideo 1.5 Local Setup Guide (2026): Tencent's Lightweight 8.3B Open Video Model

May 1, 2026
24 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 20 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

📚AI Learning Path

Go from reading about AI to building with AI 20 structured courses. Hands-on projects. Runs on your machine. Start free.

Start free
Or own it for life — Lifetime $149, pay once

HunyuanVideo 1.5 changes the equation for local video. On November 20, 2025, Tencent released an 8.3B-parameter rebuild of HunyuanVideo that runs on consumer GPUs — the official repo lists a 14 GB VRAM minimum with model offloading, and ComfyUI's native support targets 24 GB cards. That is a dramatic drop from the original December-2024 13B model, which effectively needed ~40 GB or a multi-GPU rig. Despite the smaller size, Tencent reports stronger quality and motion coherence, driven by a new attention mechanism (SSTA), glyph-aware text encoding, and a built-in 1080p super-resolution stage.

This guide leads with HunyuanVideo 1.5 — what it is, the VRAM picture, native ComfyUI setup, the model files you need, and how it compares to the original 13B — then keeps the full original-13B reference (variants, GGUF quants, the kijai wrapper, LoRA training, multi-GPU) for anyone still running that pipeline.

Table of Contents

  1. HunyuanVideo 1.5 — What Changed
  2. HunyuanVideo 1.5: Specs and VRAM
  3. HunyuanVideo 1.5 in ComfyUI (Native)
  4. 1.5 vs the Original 13B
  5. The Original HunyuanVideo (13B): What It Is
  6. 13B Variants: Base, I2V, FastVideo
  7. 13B Hardware Requirements
  8. HunyuanVideo vs Wan 2.2 vs Mochi
  9. 13B Install: ComfyUI + HunyuanVideoWrapper
  10. Downloading 13B Models and Encoders
  11. 13B GGUF Quantization Options
  12. Your First Text-to-Video (13B)
  13. Image-to-Video Workflow
  14. Prompt Engineering for Cinematic Output
  15. LoRA Training
  16. Long-Clip Generation
  17. FastVideo Distillation (13B)
  18. Multi-GPU for Lower Latency
  19. Frame Interpolation and Upscaling
  20. Performance Benchmarks (13B)
  21. Tuning Recipes
  22. Licensing
  23. Troubleshooting
  24. FAQ

Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

HunyuanVideo 1.5 — What Changed {#v15}

HunyuanVideo 1.5 (Tencent-Hunyuan/HunyuanVideo-1.5 on GitHub) shipped on November 20, 2025 with inference code and model weights, followed by official LoRA training scripts around December 5, 2025. It is a unified model that does both text-to-video and image-to-video from one architecture.

What is genuinely new versus the original 13B:

  • 8.3B parameters (down from 13B) — yet Tencent reports better visual quality and motion.
  • SSTA — Selective and Sliding Tile Attention — prunes redundant spatiotemporal key/value blocks, which Tencent credits for roughly doubling inference speed over the predecessor.
  • Glyph-aware text encoding — a byT5 glyph encoder alongside a Qwen 2.5-VL text encoder, improving in-video text (Chinese and English) and prompt fidelity.
  • Built-in cascaded super-resolution — a separate few-step distilled network upscales native output to 1080p, so HD delivery is part of the model, not a bolt-on upscaler.
  • Consumer-GPU VRAM — the headline. See the next section.

If you only read one section, read this: the original 13B was a "big-GPU only" model; 1.5 is the version most local creators can actually run.


HunyuanVideo 1.5: Specs and VRAM {#v15-specs}

SpecHunyuanVideo 1.5
Parameters8.3B
ReleasedNovember 20, 2025 (weights + code)
ModesText-to-video and image-to-video (unified)
Native resolution480p and 720p
Frame rate24 fps
Default length121 frames (~5 sec); longer durations supported
Super-resolutionBuilt-in distilled network → 1080p
Text encodersQwen 2.5-VL 7B + byT5 (glyph-aware)
AttentionSSTA (Selective and Sliding Tile Attention)
Min VRAM14 GB with model offloading enabled (official repo)
Recommended VRAM24 GB (ComfyUI native target, comfortable headroom)
LicenseTencent Hunyuan Community License (excludes EU / UK / South Korea)

On VRAM: Tencent's repo states 14 GB is the floor with offloading on — disabling offloading runs faster but needs more memory. ComfyUI's native implementation recommends 24 GB consumer cards because you are also loading the Qwen 2.5-VL encoder, the byT5 glyph encoder, the VAE, and the separate super-resolution model in addition to the diffusion transformer. Community offloading frameworks (e.g. Wan2GP) have reported pushing it onto cards as small as 6 GB with heavy optimization and a speed penalty — useful to know, but 14-24 GB is the practical range.


HunyuanVideo 1.5 in ComfyUI (Native) {#v15-comfyui}

ComfyUI added native HunyuanVideo 1.5 support in late November 2025 — there is no third-party wrapper to install for 1.5. Update ComfyUI to a current build, then load the bundled Text-to-Video or Image-to-Video 720p template.

You need five files, placed in their matching folders:

# Diffusion model (text-to-video) → ComfyUI/models/diffusion_models
hunyuanvideo1.5_720p_t2v_fp16.safetensors

# Super-resolution / upscale model → ComfyUI/models/diffusion_models
hunyuanvideo1.5_1080p_sr_distilled_fp16.safetensors

# Text encoders → ComfyUI/models/text_encoders
qwen_2.5_vl_7b_fp8_scaled.safetensors
byt5_small_glyphxl_fp16.safetensors

# VAE → ComfyUI/models/vae
hunyuanvideo15_vae_fp16.safetensors

Tencent provides official FP8 and FP16 weights. The ComfyUI templates wire these into a two-stage graph: generate at 720p, then pass through the distilled super-resolution model to 1080p. Both T2V and I2V templates ship in the box; the I2V one adds a Load Image node feeding the start frame.

Note: at the time of writing, Tencent's official release shipped FP8/FP16 weights and ComfyUI native nodes. Community GGUF quants for 1.5 may exist by the time you read this — check city96 on HuggingFace — but they were not part of the official launch, so confirm the source before downloading.


Reading articles is good. Building is better.

Free account = 20+ free chapters across 20 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

1.5 vs the Original 13B {#v15-vs-13b}

AspectHunyuanVideo 1.5 (Nov 2025)Original HunyuanVideo 13B (Dec 2024)
Parameters8.3B13B
Min VRAM14 GB (offloaded) / 24 GB comfortable~40 GB BF16; ~24 GB via community Q4 GGUF (tight)
Native resolution480p / 720p720p
Native upscalingBuilt-in → 1080pNone (use Real-ESRGAN / Topaz separately)
ComfyUI supportNative (late Nov 2025)Via kijai's HunyuanVideoWrapper
Text encodersQwen 2.5-VL + byT5 (glyph-aware)T5 + LLaMA
AttentionSSTA (faster)Standard DiT attention
LoRA trainingOfficial Tencent scripts (Dec 2025)Musubi Tuner / community
Best forMost consumer GPUs (14-24 GB)Existing 40 GB / multi-GPU rigs + mature LoRA library

Bottom line: if you are starting fresh on a 14-24 GB consumer card, use 1.5. If you already run a working 40 GB or multi-GPU 13B pipeline with trained LoRAs and a tuned workflow, there is no urgent reason to migrate — but new work should target 1.5. The rest of this guide documents the original 13B pipeline for those users.


The Original HunyuanVideo (13B): What It Is {#what-it-is}

HunyuanVideo (Tencent/HunyuanVideo on GitHub) is the original 13B-parameter video diffusion transformer released by Tencent in December 2024. It uses:

  • 3D Causal VAE — temporally aware encoding/decoding for motion coherence
  • MMDiT-style DiT backbone — proven architecture from Stable Diffusion 3 / Flux
  • T5 + LLaMA dual text encoders — richer prompt understanding than CLIP-based models
  • Native 720p / 24fps generation up to 129 frames (~5.4 sec)

License: Tencent Hunyuan Community License (permissive with EU/UK/South Korea exclusion and competitor-training restriction).

This is the model the rest of this guide covers in depth. It is still capable and widely deployed, but for new consumer-GPU setups, HunyuanVideo 1.5 (above) is the recommended starting point.


13B Variants: Base, I2V, FastVideo {#variants}

VariantParamsUseVRAM (BF16)
HunyuanVideo (base T2V)13BText-to-video40 GB
HunyuanVideo-I2V13BImage-to-video40 GB
Hunyuan-FastVideo (distilled)13BFaster T2V40 GB
HunyuanVideo Q4 GGUF13BTight VRAM24 GB
HunyuanVideo Q8 GGUF13BQuality + tighter30 GB

For consumer 24 GB GPUs: Q4 GGUF (tight, no headroom) or wait for FP8 official release.


Hardware Requirements {#requirements}

GPUVariant
RTX 4090 / 7900 XTX (24 GB)Q4 GGUF only — tight
RTX 5090 (32 GB)Q8 GGUF comfortable
RTX A6000 / Pro W7900 (48 GB)BF16 native
H100 (80 GB)BF16 + long context
Mac Studio M4 Max 128GBBF16 (slow but works)
Multi-GPU 2x 4090BF16 with model split

System RAM 32 GB+, 64 GB recommended. Disk 50-100 GB for models + workflows.


HunyuanVideo vs Wan 2.2 vs Mochi {#comparison}

This table covers the original 13B HunyuanVideo. For HunyuanVideo 1.5's lower VRAM floor (14-24 GB), see the 1.5 vs 13B section above — 1.5 changes the VRAM column dramatically.

AspectHunyuanVideo 13BWan 2.2 14BMochi 10B
Min VRAMQ4: 24 GB (tight)Q8: 22 GB16 GB
Render time (5sec/720p, RTX 4090)12-20 min6-8 min5-8 min
Long-clip coherenceStrongGoodLimited
LoRA ecosystemMatureGrowingSmaller
LicenseTencent Community (EU/UK/KR excluded)PermissiveApache 2.0

For the lowest VRAM floor among the Hunyuan family, use HunyuanVideo 1.5 (8.3B). For lowest VRAM with an Apache 2.0 license, use Mochi. We avoid declaring a single "best quality" winner here — relative quality shifts with each point release, and the honest differentiators are VRAM, license, and ecosystem.


Installation: ComfyUI + HunyuanVideoWrapper {#installation}

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
cd ComfyUI-HunyuanVideoWrapper
pip install -r requirements.txt

# For GGUF support
cd ../
git clone https://github.com/city96/ComfyUI-GGUF

Restart ComfyUI. Hunyuan nodes appear under HunyuanVideo category.


Downloading Models and Encoders {#downloading}

Base T2V (BF16)

huggingface-cli download tencent/HunyuanVideo \
    --local-dir ComfyUI/models/diffusion_models/hunyuan

Q4 GGUF (24 GB GPU friendly)

huggingface-cli download city96/HunyuanVideo-gguf \
    hunyuan-video-t2v-720p-Q4_K_S.gguf \
    --local-dir ComfyUI/models/diffusion_models

Text encoders

huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
    split_files/text_encoders/llava_llama3_fp8_scaled.safetensors \
    split_files/text_encoders/clip_l.safetensors \
    --local-dir ComfyUI/models/text_encoders

huggingface-cli download Comfy-Org/HunyuanVideo_repackaged \
    split_files/vae/hunyuan_video_vae_bf16.safetensors \
    --local-dir ComfyUI/models/vae

GGUF Quantization Options {#gguf}

QuantVRAMQualityTime (5sec/720p, RTX 4090)
BF1640 GBReference12 min
FP825 GB~99% of BF1613 min
Q8_022 GB~99% of BF1614 min
Q5_K_M16 GB~95% of BF1613 min
Q4_K_S12 GB~90% of BF1612 min

For RTX 4090 24 GB: Q5_K_M sweet spot (room for KV cache and longer clips). Q4 works but tight. Q8 is too tight on 24 GB once you add overheads.


Your First Text-to-Video {#first-t2v}

Load example_workflow.json from ComfyUI-HunyuanVideoWrapper/example_workflows/:

[UNet Loader (GGUF)] → MODEL (hunyuan-video-t2v-Q5_K_M.gguf)
[DualCLIPLoader] → CLIP (llava_llama3 + clip_l)
[Load VAE] → VAE (hunyuan_video_vae_bf16)
[CLIP Text Encode] → CONDITIONING positive
[CLIP Text Encode] → CONDITIONING negative
[Empty Hunyuan Latent] → LATENT (1280x720, 73 frames)
[KSampler] (30 steps, cfg 6.0, dpmpp_2m_sde, sgm_uniform)
[VAE Decode (Tiled)] → IMAGES
[VHS_VideoCombine] → MP4 (24fps)

Click Queue Prompt. RTX 4090 Q5: ~12-15 min for 3-second 720p output.


Image-to-Video Workflow {#i2v}

For HunyuanVideo-I2V:

[Load Image] → IMAGE (start frame, 720x1280 or 1280x720)
[Load Hunyuan I2V Model] → MODEL
[CLIP Vision Encode] → image conditioning
[HunyuanI2VLatent] → LATENT
[KSampler] (30 steps)
[VAE Decode]

The start image strongly drives composition; prompt drives motion and atmosphere. Use sharp, well-composed inputs at the same aspect ratio as your output.


Prompt Engineering for Cinematic Output {#prompts}

HunyuanVideo responds best to detailed cinematographic prompts:

Professional cinematic shot of [subject] [action]. [Camera move] from [angle].
[Lens type], [aperture]. [Lighting]. [Atmosphere]. [Color grading].
[Time of day]. [Style references].

Example:

Professional cinematic dolly-in of a samurai sheathing his katana, slow-motion, 35mm anamorphic lens, f/2.8, golden-hour rim lighting, atmospheric haze, teal-and-orange color grade, dawn, in the style of Akira Kurosawa cinematography.

Negative prompt:

blurry, deformed, jittery, watermark, text overlay, low quality, harsh artifacts, frame duplication, oversaturated

For LoRA-conditioned generation, include the LoRA trigger word in the prompt.


LoRA Training via Musubi Tuner {#lora-training}

git clone https://github.com/kohya-ss/musubi-tuner
cd musubi-tuner
pip install -r requirements.txt

Dataset format: 10-50 short video clips (3-5 seconds each) at 720p, paired with caption .txt files.

python hv_train_network.py \
    --pretrained_model_name_or_path /path/to/hunyuan-video \
    --dit_dtype bf16 \
    --dataset_config /path/to/config.toml \
    --network_module networks.lora \
    --network_dim 32 --network_alpha 16 \
    --learning_rate 1e-4 \
    --output_dir ./lora_output \
    --max_train_steps 4000

Time: ~6-12 hours on RTX 4090 for a 4000-step LoRA. Result: my_lora.safetensors loadable in ComfyUI workflows.

For character / style LoRAs: 20-30 reference clips covering varied angles, lighting, expressions.


Frame-Pack and Long-Clip Generation {#frame-pack}

HunyuanVideo natively supports up to 129 frames (~5.4 sec at 24 fps). For longer:

  • Frame-pack composition (built into wrapper): chain shorter generations with frame overlap for smoother joins.
  • HunyuanVideo-LongContext (community fine-tune): up to ~10-second native generation at the cost of higher VRAM.
  • Last-frame chaining: extract last frame, use as I2V input for next segment.

For narrative work, treat HunyuanVideo as a per-shot tool — generate 5-second shots covering your storyboard, edit together.


FastVideo Distillation {#fastvideo}

FastVideo (hao-ai-lab/FastVideo) is a distilled variant from HunyuanVideo that runs 4-8x faster with slight quality loss:

huggingface-cli download FastVideo/FastHunyuan \
    --local-dir ComfyUI/models/diffusion_models

Render time on RTX 4090: ~3-5 min for 5-second 720p (vs 12-20 min base). Quality is ~90-95% of base Hunyuan; for iteration / drafting it's the right choice. For final renders use base.


Multi-GPU for Lower Latency {#multi-gpu}

For 2x RTX 4090 (48 GB total):

# In ComfyUI-MultiGPU node configuration
diffusion_model_device: cuda:0
text_encoder_device: cuda:1   # offload T5 to second GPU
vae_device: cuda:0

Result: BF16 fits across two GPUs without quantization. Render time ~10 min on 2x 4090 vs 20 min on single 4090 Q4. For heavy production work, dual-GPU is the practical target.


Frame Interpolation and Upscaling {#post-process}

24fps → 60fps via RIFE:

[VAE Decode] → IMAGES → [RIFE VFI] → IMAGES (60fps)

720p → 1440p / 4K via Real-ESRGAN x2 or Topaz Video AI:

[Upscale Image (using Model)] (Real-ESRGAN x2 or x4-UltraSharp)

Combined render+interp+upscale on RTX 4090: ~25-35 min for a 5-second 1440p 60fps clip.


Performance Benchmarks {#benchmarks}

5-second 720p clip, 30 steps, RTX 4090 Q5_K_M:

StageTime
Generation13 min
RIFE 24→60fps1 min
Real-ESRGAN x24 min
Total~18 min

For 1280x720 → 3840x2160 (4K) via Real-ESRGAN x4: add ~10 min.

For batch generation (10 shots overnight): set up workflow loop with seed iteration; plan ~3-4 hours per 10 shots at 720p.


Tuning Recipes {#tuning}

RTX 4090 (best quality consumer)

Q5_K_M GGUF + 30 steps + dpmpp_2m_sde + sgm_uniform + cfg 6.0.

RTX 5090 32 GB

Q8 GGUF or FP8 + 30 steps + slightly tighter sampler for sharper output.

Pro W7900 / A6000 48 GB

BF16 + 40 steps + extended frame count (97 frames = 4 sec at 24fps).

Mac Studio M4 Max 128 GB

BF16 + 30 steps. Slow (4-5x RTX 4090) but works for occasional renders.


Licensing {#licensing}

Both the original 13B HunyuanVideo and HunyuanVideo 1.5 ship under the Tencent Hunyuan Community License Agreement — NOT Apache 2.0 (several third-party write-ups incorrectly claimed 1.5 was Apache 2.0; the official LICENSE file is the Tencent Community license). Permissive for most commercial use, with explicit restrictions:

  • Territory: The license explicitly does NOT apply in the European Union, the United Kingdom, or South Korea.
  • Scale cap: Above 100 million monthly active users, you must request a separate license from Tencent (granted at its discretion).
  • Competitor training: Cannot use outputs to train competing AI models.
  • Disclosure / attribution: Must disclose you are the service provider and preserve Tencent attribution; the license is governed by Hong Kong law.

For deployment in the EU, UK, or South Korea, or for an unambiguous Apache 2.0 license, switch to Mochi (Apache 2.0) or CogVideoX. Always read the LICENSE in the official repo before commercial use.


Troubleshooting {#troubleshooting}

SymptomCauseFix
OOM at samplingVRAM too tightLower frame count or use Q4
OOM at VAE decodeTiled VAE not enabledUse Tiled VAE Decode
Black outputNaN in fp16 VAEUse bf16 VAE explicitly
Workflow won't loadWrapper version mismatchUpdate wrapper + restart
Slow on 7900 XTXFlashAttention buildUse ROCm FA-2 fork
Multi-GPU errorshm too smallIncrease --shm-size to 16+ GB
Jittery motionToo few stepsIncrease to 35-40

FAQ {#faq}

See answers to common HunyuanVideo questions below.


Sources: HunyuanVideo 1.5 GitHub (Tencent) | HunyuanVideo 1.5 — ComfyUI native support | HunyuanVideo 1.5 model card (HuggingFace) | Original HunyuanVideo 13B GitHub | ComfyUI-HunyuanVideoWrapper | Musubi Tuner | city96 GGUF quants | Internal benchmarks RTX 4090, RTX 5090, Pro W7900.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once

Liked this? 20 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Want structured AI education?

20 courses, 495+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 1, 2026🔄 Last Updated: June 20, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes a HunyuanVideo + ComfyUI multi-GPU video reference deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 20 courses that take you from reading about AI to building AI.

Was this helpful?

LM

Written by the Local AI Master Team

The team behind Local AI Master

We build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

20 structured courses. Hands-on projects. Runs on your machine. Start free.

Or own it for life — Lifetime $149 $599, pay once
Free Tools & Calculators