-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[refactor] move positional embeddings to patch embed layer for CogVideoX #9263
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
cc @zRzRzRzRzRzRzR for visibility The changes here should not affect the quality of final outputs in any way
I've verified these but a second look by others would be appreciated Pushing the num_frames to the max (89 frames) after which results start to get much worse, here are the results:
|
Is the resolution hard-coded for this model? (720x480) And everything else will break? |
You can generate at other resolutions as well just like any other model. The recommendations are 720x480 because that's the resolution of most of the training dataset I believe. I haven't experimented extensively yet but I've gotten good results at 512x512, 512x768, 768x768, 640x640 with the 5B model. Have not tried different resolutions on the 2B model but if you face any issues, we could work together on fixing it |
@a-r-r-o-w I have only access to the 2B model. And tried 704x480 and the video turned out as stair casing lines, so I assumed the res was hard coded. I'll try the resolutions you're suggesting. Is there anywhere the 5B model is up? Sorry for sidetracking this patch. |
The 5B model is yet to be officially released. It will be out in a day or two. The 2B model does not use rotary positional embeddings and uses basic positional embeddings, so the hardcoded values that you see is irrelevant there. For the rotary embeds, the embedding grid needs to be appropriately resized/rescaled based on user input. It's generally done by using a fixed base latent resolution (and generally chosen to be the resolution that is most frequent in the dataset). In this case, that happens to be |
@a-r-r-o-w Btw. I'm getting this message when running 2b via Diffusers: |
@a-r-r-o-w This is 2b doing 512x512: -1872826932_A_rollercoaster_car_is_launched_from_a.mp4 |
Ah yes, these parameters were removed from the implementation since they were not relevant/correct. It's a harmless warning. I'll open a PR to the 2B repo in a bit to remove these from the VAE config.json |
Keep in mind that normal positional encoding, as used in 2B, is not great at generalizing to different resolutions when the training data does not include ample amount of multires examples. RoPE, as used in 5B, can somewhat deal with these issues with lesser examples, however further multires finetuning is required in Cog-2B/5B to properly address these concerns (something that the community might pick up on fairly soon hopefully). Here's a 5B 512x512 result: cog-5b-512.webm |
@a-r-r-o-w PM me, if you need someone to test Diffusers 5b locally on a RTX 4090 - on Windows - before the release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
I'm aware there is some discrepancy between the the rotary and sincos embedding, but moving it to pipeline will create more discrepancy from rest of the code base so I think we can either leave it as it is for now, or maybe consider adding a positional embedding in transformer for rotary embedding too (and we can apply same patten across all relevant pipeline in a follow-up PR)
let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR!
cc @zRzRzRzRzRzRzR for visibility. Should only impact CogVideoX-2b in terms of implementation, but have no effect on final outputs |
…eoX (#9263) * remove frame limit in cogvideox * remove debug prints * Update src/diffusers/models/transformers/cogvideox_transformer_3d.py * revert pipeline; remove frame limitation * revert transformer changes * address review comments * add error message * apply suggestions from review
…eoX (#9263) * remove frame limit in cogvideox * remove debug prints * Update src/diffusers/models/transformers/cogvideox_transformer_3d.py * revert pipeline; remove frame limitation * revert transformer changes * address review comments * add error message * apply suggestions from review
What does this PR do?
removes the 49-frame limit since CogVideoX-5B generalizes better than 2B and is able to generate more framesmoves the positional embedding creation logic to the pipeline similar to rotary embeddingsmove the positional embedding logic to patch embed layer
as a side-effect of this PR, one can generate > 49 frames with CogVideoX-2b which will produce bad results, but we can add a recommendation about this.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@yiyixuxu