Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor] move positional embeddings to patch embed layer for CogVideoX #9263

Merged
merged 12 commits into from
Sep 3, 2024

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented Aug 24, 2024

What does this PR do?

  • removes the 49-frame limit since CogVideoX-5B generalizes better than 2B and is able to generate more frames

  • moves the positional embedding creation logic to the pipeline similar to rotary embeddings

  • move the positional embedding logic to patch embed layer

as a side-effect of this PR, one can generate > 49 frames with CogVideoX-2b which will produce bad results, but we can add a recommendation about this.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yiyixuxu

@a-r-r-o-w a-r-r-o-w changed the title [refactor] removes the frame limititation in CogVideoX [refactor] removes the frame limitation in CogVideoX Aug 24, 2024
@a-r-r-o-w a-r-r-o-w changed the title [refactor] removes the frame limitation in CogVideoX [refactor] remove the frame limitation in CogVideoX Aug 24, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Aug 24, 2024

cc @zRzRzRzRzRzRzR for visibility

The changes here should not affect the quality of final outputs in any way

  • The results before and after this PR for CogVideoX-2B still match 1:1 as expected.
  • This PR also should not affect CogVideoX-5B.

I've verified these but a second look by others would be appreciated

Pushing the num_frames to the max (89 frames) after which results start to get much worse, here are the results:

cogvideox-2b-89frames.webm
cogvideox-5b-89frames.webm

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu August 24, 2024 01:32
@tin2tin
Copy link

tin2tin commented Aug 24, 2024

Is the resolution hard-coded for this model? (720x480) And everything else will break?

@a-r-r-o-w
Copy link
Member Author

Is the resolution hard-coded for this model? (720x480) And everything else will break?

You can generate at other resolutions as well just like any other model. The recommendations are 720x480 because that's the resolution of most of the training dataset I believe. I haven't experimented extensively yet but I've gotten good results at 512x512, 512x768, 768x768, 640x640 with the 5B model. Have not tried different resolutions on the 2B model but if you face any issues, we could work together on fixing it

@tin2tin
Copy link

tin2tin commented Aug 25, 2024

@a-r-r-o-w I have only access to the 2B model. And tried 704x480 and the video turned out as stair casing lines, so I assumed the res was hard coded.
In this patch the 720x480 values appear - but I don't know what the code does: 960c149#diff-b757c92f175a14386152bfcdfc1e67cff57b6bc67808c32769792a22c8f8fdf5R455
image

I'll try the resolutions you're suggesting. Is there anywhere the 5B model is up?

Sorry for sidetracking this patch.

@a-r-r-o-w
Copy link
Member Author

The 5B model is yet to be officially released. It will be out in a day or two. The 2B model does not use rotary positional embeddings and uses basic positional embeddings, so the hardcoded values that you see is irrelevant there.

For the rotary embeds, the embedding grid needs to be appropriately resized/rescaled based on user input. It's generally done by using a fixed base latent resolution (and generally chosen to be the resolution that is most frequent in the dataset). In this case, that happens to be 720 // (8 * 2) x 480 // (8 * 2). You can see another example of this here for HunyuanDiT.

@tin2tin
Copy link

tin2tin commented Aug 25, 2024

@a-r-r-o-w Btw. I'm getting this message when running 2b via Diffusers:
The config attributes {'mid_block_add_attention': True, 'sample_size': 256} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.

@tin2tin
Copy link

tin2tin commented Aug 25, 2024

@a-r-r-o-w This is 2b doing 512x512:

-1872826932_A_rollercoaster_car_is_launched_from_a.mp4

@a-r-r-o-w
Copy link
Member Author

Ah yes, these parameters were removed from the implementation since they were not relevant/correct. It's a harmless warning. I'll open a PR to the 2B repo in a bit to remove these from the VAE config.json

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Aug 25, 2024

@a-r-r-o-w This is 2d doing 512x512:

Keep in mind that normal positional encoding, as used in 2B, is not great at generalizing to different resolutions when the training data does not include ample amount of multires examples. RoPE, as used in 5B, can somewhat deal with these issues with lesser examples, however further multires finetuning is required in Cog-2B/5B to properly address these concerns (something that the community might pick up on fairly soon hopefully). Here's a 5B 512x512 result:

cog-5b-512.webm

@tin2tin
Copy link

tin2tin commented Aug 25, 2024

@a-r-r-o-w PM me, if you need someone to test Diffusers 5b locally on a RTX 4090 - on Windows - before the release.

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!
I'm aware there is some discrepancy between the the rotary and sincos embedding, but moving it to pipeline will create more discrepancy from rest of the code base so I think we can either leave it as it is for now, or maybe consider adding a positional embedding in transformer for rotary embedding too (and we can apply same patten across all relevant pipeline in a follow-up PR)
let me know what you think!

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu August 27, 2024 11:16
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment!

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu September 2, 2024 10:23
@a-r-r-o-w a-r-r-o-w changed the title [refactor] remove the frame limitation in CogVideoX [refactor] move positional embeddings to patch embed layer for CogVideoX Sep 2, 2024
Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

@a-r-r-o-w
Copy link
Member Author

cc @zRzRzRzRzRzRzR for visibility. Should only impact CogVideoX-2b in terms of implementation, but have no effect on final outputs

@a-r-r-o-w a-r-r-o-w merged commit 9d49b45 into main Sep 3, 2024
18 checks passed
@a-r-r-o-w a-r-r-o-w deleted the cogvideox/pipeline-followups branch September 3, 2024 09:15
a-r-r-o-w added a commit that referenced this pull request Sep 17, 2024
…eoX (#9263)

* remove frame limit in cogvideox

* remove debug prints

* Update src/diffusers/models/transformers/cogvideox_transformer_3d.py

* revert pipeline; remove frame limitation

* revert transformer changes

* address review comments

* add error message

* apply suggestions from review
sayakpaul pushed a commit that referenced this pull request Dec 23, 2024
…eoX (#9263)

* remove frame limit in cogvideox

* remove debug prints

* Update src/diffusers/models/transformers/cogvideox_transformer_3d.py

* revert pipeline; remove frame limitation

* revert transformer changes

* address review comments

* add error message

* apply suggestions from review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants