Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HunyuanVideo #10106

Closed
hlky opened this issue Dec 3, 2024 · 12 comments · Fixed by #10376
Closed

Add HunyuanVideo #10106

hlky opened this issue Dec 3, 2024 · 12 comments · Fixed by #10376

Comments

@hlky
Copy link
Collaborator

hlky commented Dec 3, 2024

HunyuanVideo

We present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. HunyuanVideo features a comprehensive framework that integrates several key contributions, including data curation, image-video joint model training, and an efficient infrastructure designed to facilitate large-scale model training and inference. Additionally, through an effective strategy for scaling model architecture and dataset, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models.

Code
Weights

🎥 Demo (HQ version)

demo.mp4

@hlky
Copy link
Collaborator Author

hlky commented Dec 17, 2024

Closed by #10136

@hlky hlky closed this as completed Dec 17, 2024
@svjack
Copy link

svjack commented Dec 21, 2024

Closed by #10136

How about lora support ?
It seems lora in HunyuanVideoLoraLoaderMixin support lora block in text encoder not same with lora in comfyui support

ValueError: Target modules {'img_attn_qkv', 'txt_attn_proj', 'txt_mod.linear', 'img_mod.linear', 'img_attn_proj', 'txt_attn_qkv', 'linear1', 'fc2', 'modulation.linear', 'fc1', 'linear2'} not found in the base model. Please check the target modules and try again.

@a-r-r-o-w
Copy link
Member

@svjack LoRA loading support was added in #10254, and training support was added here: a-r-r-o-w/finetrainers#126

@svjack
Copy link

svjack commented Dec 21, 2024

@svjack LoRA loading support was added in #10254, and training support was added here: a-r-r-o-w/finetrainers#126

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
from diffusers.utils import export_to_video

from enhance_a_video import enable_enhance, inject_feta_for_hunyuanvideo, set_enhance_weight

model_id = "tencent/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16, revision="refs/pr/18"
)
pipe = HunyuanVideoPipeline.from_pretrained(
    model_id, transformer=transformer, revision="refs/pr/18", torch_dtype=torch.bfloat16
)

#### from https://huggingface.co/svjack/Genshin_Impact_XiangLing_Low_Res_HunyuanVideo_lora_early
pipe.load_lora_weights("Genshin_Impact_XiangLing_Low_Res_HunyuanVideo_lora_early/xiangling_ep2_lora.safetensors")
ValueError: Target modules {'img_attn_qkv', 'txt_attn_proj', 'txt_mod.linear', 'img_mod.linear', 'img_attn_proj', 'txt_attn_qkv', 'linear1', 'fc2', 'modulation.linear', 'fc1', 'linear2'} not found in the base model. Please check the target modules and try again.

@a-r-r-o-w
Copy link
Member

It seems like this lora was not trained on diffusers codebase, so the layer names are different than expected (seems to be from the original hunyuan codebase). Since I did see a few loras with original codebase, I'll add support for loading these soon. For now, only diffusers-format loras are supported

@junsukha
Copy link

@a-r-r-o-w
is there an example or script instructing how to train a LoRA for HunyuanVideo?

@AlphaNext
Copy link

@hlky How to avoid this warning or error ? Thanks.

Token indices sequence length is longer than the specified maximum sequence length for this model (123 > 77). Running this sequence through the model will result in indexing errorsThe following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: 

@hlky
Copy link
Collaborator Author

hlky commented Dec 30, 2024

@AlphaNext You can use a shorter prompt for CLIP with prompt_2 or provide pre-generated pooled_prompt_embeds.

@AlphaNext
Copy link

@hlky Thanks for your reply. I read the official hunyuan_video code again, the text_encoder_2 (CLIP model) is not used in official code, and the official code generates video without similar warnings by using the same prompt. So how could I disable the CLIP model in diffusers version and only use text_encoder to encode prompt and generate video?

Emmm, what‘s the difference between prompt and prompt_2 and how to get pre-generated pooled_prompt_embeds

@hlky
Copy link
Collaborator Author

hlky commented Dec 30, 2024

@AlphaNext The original uses the same text_encoder_2 and has the same token length limitation, input is silently truncated.

All warnings can be hidden like this

from diffusers.utils.logging import set_verbosity_error as set_verbosity_error_diffusers
from transformers.utils.logging import set_verbosity_error as set_verbosity_error_transformers

set_verbosity_error_diffusers()
set_verbosity_error_transformers()

If we are generating multiple videos from the same prompt and use _get_clip_prompt_embeds then provide pooled_prompt_embeds we would only see the warning once, alternatively use code based on _get_clip_prompt_embeds without the warning.

import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel

model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
    model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
pooled_prompt_embeds = pipe._get_clip_prompt_embeds("prompt " * 100)
pipe(
    ...,
    pooled_prompt_embeds=pooled_prompt_embeds,
)

We can also provide a shorter prompt to be used for text_encoder_2.

pipe(
    ...,
    prompt="prompt " * 100,
    prompt_2="prompt " * 5,
)

@AlphaNext
Copy link

@hlky Thanks it works.
Some predict params in official code (like: flow_shift, flow_reverse, embedded_cfg_scale) are not seen in diffusers version, these parameters don't have much effect in video generation?

@Ednaordinary
Copy link

@AlphaNext The first two settings are available in the flow match euler sampler, while the third one comes down to an architectural note. The released checkpoint is cfg distilled, so instead of having real guidance scale, it trains what each guidance looks like and adds that as an input to the model, in the same way as Flux (dev, specifically). I'm not sure which is which in the repo, but guidance scale in diffusers is the trained "fake" guidance scale for the diffusers version. I've also noticed it has little to no impact on the final result with Hunyuan video. No idea what using the real guidance scale would look like here, it's worth looking into, though Flux dev tends output worse results (along with taking twice as long)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants