-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HunyuanVideo #10106
Comments
Closed by #10136 |
How about lora support ? ValueError: Target modules {'img_attn_qkv', 'txt_attn_proj', 'txt_mod.linear', 'img_mod.linear', 'img_attn_proj', 'txt_attn_qkv', 'linear1', 'fc2', 'modulation.linear', 'fc1', 'linear2'} not found in the base model. Please check the target modules and try again. |
@svjack LoRA loading support was added in #10254, and training support was added here: a-r-r-o-w/finetrainers#126 |
ValueError: Target modules {'img_attn_qkv', 'txt_attn_proj', 'txt_mod.linear', 'img_mod.linear', 'img_attn_proj', 'txt_attn_qkv', 'linear1', 'fc2', 'modulation.linear', 'fc1', 'linear2'} not found in the base model. Please check the target modules and try again. |
It seems like this lora was not trained on diffusers codebase, so the layer names are different than expected (seems to be from the original hunyuan codebase). Since I did see a few loras with original codebase, I'll add support for loading these soon. For now, only diffusers-format loras are supported |
@a-r-r-o-w |
@hlky How to avoid this warning or error ? Thanks.
|
@AlphaNext You can use a shorter prompt for CLIP with |
@hlky Thanks for your reply. I read the official hunyuan_video code again, the text_encoder_2 (CLIP model) is not used in official code, and the official code generates video without similar warnings by using the same prompt. So how could I disable the CLIP model in diffusers version and only use text_encoder to encode prompt and generate video? Emmm, what‘s the difference between |
@AlphaNext The original uses the same All warnings can be hidden like this from diffusers.utils.logging import set_verbosity_error as set_verbosity_error_diffusers
from transformers.utils.logging import set_verbosity_error as set_verbosity_error_transformers
set_verbosity_error_diffusers()
set_verbosity_error_transformers() If we are generating multiple videos from the same prompt and use import torch
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
pooled_prompt_embeds = pipe._get_clip_prompt_embeds("prompt " * 100)
pipe(
...,
pooled_prompt_embeds=pooled_prompt_embeds,
) We can also provide a shorter prompt to be used for pipe(
...,
prompt="prompt " * 100,
prompt_2="prompt " * 5,
) |
@hlky Thanks it works. |
@AlphaNext The first two settings are available in the flow match euler sampler, while the third one comes down to an architectural note. The released checkpoint is cfg distilled, so instead of having real guidance scale, it trains what each guidance looks like and adds that as an input to the model, in the same way as Flux (dev, specifically). I'm not sure which is which in the repo, but guidance scale in diffusers is the trained "fake" guidance scale for the diffusers version. I've also noticed it has little to no impact on the final result with Hunyuan video. No idea what using the real guidance scale would look like here, it's worth looking into, though Flux dev tends output worse results (along with taking twice as long) |
HunyuanVideo
We present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. HunyuanVideo features a comprehensive framework that integrates several key contributions, including data curation, image-video joint model training, and an efficient infrastructure designed to facilitate large-scale model training and inference. Additionally, through an effective strategy for scaling model architecture and dataset, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models.
Code
Weights
🎥 Demo (HQ version)
demo.mp4
The text was updated successfully, but these errors were encountered: