Add HunyuanVideo #10106

hlky · 2024-12-03T20:09:15Z

HunyuanVideo

We present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. HunyuanVideo features a comprehensive framework that integrates several key contributions, including data curation, image-video joint model training, and an efficient infrastructure designed to facilitate large-scale model training and inference. Additionally, through an effective strategy for scaling model architecture and dataset, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models.

Code
Weights

🎥 Demo (HQ version)

demo.mp4

hlky · 2024-12-17T21:39:19Z

Closed by #10136

svjack · 2024-12-21T12:21:12Z

Closed by #10136

How about lora support ?
It seems lora in HunyuanVideoLoraLoaderMixin support lora block in text encoder not same with lora in comfyui support

ValueError: Target modules {'img_attn_qkv', 'txt_attn_proj', 'txt_mod.linear', 'img_mod.linear', 'img_attn_proj', 'txt_attn_qkv', 'linear1', 'fc2', 'modulation.linear', 'fc1', 'linear2'} not found in the base model. Please check the target modules and try again.

a-r-r-o-w · 2024-12-21T12:38:53Z

@svjack LoRA loading support was added in #10254, and training support was added here: a-r-r-o-w/finetrainers#126

svjack · 2024-12-21T12:56:14Z

@svjack LoRA loading support was added in #10254, and training support was added here: a-r-r-o-w/finetrainers#126

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

from enhance_a_video import enable_enhance, inject_feta_for_hunyuanvideo, set_enhance_weight

model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18"
)
pipe = HunyuanVideoPipeline.from_pretrained(
    model_id, transformer=transformer, revision="refs/pr/18", torch_dtype=torch.bfloat16
)

#### from https://huggingface.co/svjack/Genshin_Impact_XiangLing_Low_Res_HunyuanVideo_lora_early
pipe.load_lora_weights("Genshin_Impact_XiangLing_Low_Res_HunyuanVideo_lora_early/xiangling_ep2_lora.safetensors")

ValueError: Target modules {'img_attn_qkv', 'txt_attn_proj', 'txt_mod.linear', 'img_mod.linear', 'img_attn_proj', 'txt_attn_qkv', 'linear1', 'fc2', 'modulation.linear', 'fc1', 'linear2'} not found in the base model. Please check the target modules and try again.

a-r-r-o-w · 2024-12-21T12:58:42Z

It seems like this lora was not trained on diffusers codebase, so the layer names are different than expected (seems to be from the original hunyuan codebase). Since I did see a few loras with original codebase, I'll add support for loading these soon. For now, only diffusers-format loras are supported

junsukha · 2024-12-26T05:01:58Z

@a-r-r-o-w
is there an example or script instructing how to train a LoRA for HunyuanVideo?

AlphaNext · 2024-12-30T10:28:51Z

@hlky How to avoid this warning or error ? Thanks.

Token indices sequence length is longer than the specified maximum sequence length for this model (123 > 77). Running this sequence through the model will result in indexing errorsThe following part of your input was truncated because CLIP can only handle sequences up to 77 tokens:

hlky · 2024-12-30T10:49:53Z

@AlphaNext You can use a shorter prompt for CLIP with prompt_2 or provide pre-generated pooled_prompt_embeds.

AlphaNext · 2024-12-30T12:31:41Z

@hlky Thanks for your reply. I read the official hunyuan_video code again, the text_encoder_2 (CLIP model) is not used in official code, and the official code generates video without similar warnings by using the same prompt. So how could I disable the CLIP model in diffusers version and only use text_encoder to encode prompt and generate video?

Emmm, what‘s the difference between prompt and prompt_2 and how to get pre-generated pooled_prompt_embeds

hlky · 2024-12-30T13:27:39Z

@AlphaNext The original uses the same text_encoder_2 and has the same token length limitation, input is silently truncated.

All warnings can be hidden like this

from diffusers.utils.logging import set_verbosity_error as set_verbosity_error_diffusers
from transformers.utils.logging import set_verbosity_error as set_verbosity_error_transformers

set_verbosity_error_diffusers()
set_verbosity_error_transformers()

If we are generating multiple videos from the same prompt and use _get_clip_prompt_embeds then provide pooled_prompt_embeds we would only see the warning once, alternatively use code based on _get_clip_prompt_embeds without the warning.

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel

model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
pooled_prompt_embeds = pipe._get_clip_prompt_embeds("prompt " * 100)
pipe(
    ...,
    pooled_prompt_embeds=pooled_prompt_embeds,
)

We can also provide a shorter prompt to be used for text_encoder_2.

pipe(
    ...,
    prompt="prompt " * 100,
    prompt_2="prompt " * 5,
)

AlphaNext · 2024-12-31T05:59:38Z

@hlky Thanks it works.
Some predict params in official code (like: flow_shift, flow_reverse, embedded_cfg_scale) are not seen in diffusers version, these parameters don't have much effect in video generation?

Ednaordinary · 2024-12-31T06:30:16Z

@AlphaNext The first two settings are available in the flow match euler sampler, while the third one comes down to an architectural note. The released checkpoint is cfg distilled, so instead of having real guidance scale, it trains what each guidance looks like and adds that as an input to the model, in the same way as Flux (dev, specifically). I'm not sure which is which in the repo, but guidance scale in diffusers is the trained "fake" guidance scale for the diffusers version. I've also noticed it has little to no impact on the final result with Hunyuan video. No idea what using the real guidance scale would look like here, it's worth looking into, though Flux dev tends output worse results (along with taking twice as long)

hlky added the New pipeline/model label Dec 3, 2024

hlky closed this as completed Dec 17, 2024

a-r-r-o-w mentioned this issue Dec 25, 2024

[LoRA] Support original format loras for HunyuanVideo #10376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HunyuanVideo #10106

Add HunyuanVideo #10106

hlky commented Dec 3, 2024

hlky commented Dec 17, 2024

svjack commented Dec 21, 2024 •

edited

Loading

a-r-r-o-w commented Dec 21, 2024

svjack commented Dec 21, 2024

a-r-r-o-w commented Dec 21, 2024

junsukha commented Dec 26, 2024

AlphaNext commented Dec 30, 2024

hlky commented Dec 30, 2024

AlphaNext commented Dec 30, 2024

hlky commented Dec 30, 2024

AlphaNext commented Dec 31, 2024

Ednaordinary commented Dec 31, 2024

Add HunyuanVideo #10106

Add HunyuanVideo #10106

Comments

hlky commented Dec 3, 2024

HunyuanVideo

hlky commented Dec 17, 2024

svjack commented Dec 21, 2024 • edited Loading

a-r-r-o-w commented Dec 21, 2024

svjack commented Dec 21, 2024

a-r-r-o-w commented Dec 21, 2024

junsukha commented Dec 26, 2024

AlphaNext commented Dec 30, 2024

hlky commented Dec 30, 2024

AlphaNext commented Dec 30, 2024

hlky commented Dec 30, 2024

AlphaNext commented Dec 31, 2024

Ednaordinary commented Dec 31, 2024

svjack commented Dec 21, 2024 •

edited

Loading