[refactor] move positional embeddings to patch embed layer for CogVideoX #9263

a-r-r-o-w · 2024-08-24T01:11:26Z

What does this PR do?

~~removes the 49-frame limit since CogVideoX-5B generalizes better than 2B and is able to generate more frames~~
~~moves the positional embedding creation logic to the pipeline similar to rotary embeddings~~
move the positional embedding logic to patch embed layer

as a side-effect of this PR, one can generate > 49 frames with CogVideoX-2b which will produce bad results, but we can add a recommendation about this.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yiyixuxu

HuggingFaceDocBuilderDev · 2024-08-24T01:17:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2024-08-24T01:31:54Z

cc @zRzRzRzRzRzRzR for visibility

The changes here should not affect the quality of final outputs in any way

The results before and after this PR for CogVideoX-2B still match 1:1 as expected.
This PR also should not affect CogVideoX-5B.

I've verified these but a second look by others would be appreciated

Pushing the num_frames to the max (89 frames) after which results start to get much worse, here are the results:

cogvideox-2b-89frames.webm	cogvideox-5b-89frames.webm

src/diffusers/models/transformers/cogvideox_transformer_3d.py

tin2tin · 2024-08-24T23:47:32Z

Is the resolution hard-coded for this model? (720x480) And everything else will break?

a-r-r-o-w · 2024-08-25T08:24:22Z

Is the resolution hard-coded for this model? (720x480) And everything else will break?

You can generate at other resolutions as well just like any other model. The recommendations are 720x480 because that's the resolution of most of the training dataset I believe. I haven't experimented extensively yet but I've gotten good results at 512x512, 512x768, 768x768, 640x640 with the 5B model. Have not tried different resolutions on the 2B model but if you face any issues, we could work together on fixing it

tin2tin · 2024-08-25T08:47:21Z

@a-r-r-o-w I have only access to the 2B model. And tried 704x480 and the video turned out as stair casing lines, so I assumed the res was hard coded.
In this patch the 720x480 values appear - but I don't know what the code does: 960c149#diff-b757c92f175a14386152bfcdfc1e67cff57b6bc67808c32769792a22c8f8fdf5R455

I'll try the resolutions you're suggesting. Is there anywhere the 5B model is up?

Sorry for sidetracking this patch.

a-r-r-o-w · 2024-08-25T09:03:31Z

The 5B model is yet to be officially released. It will be out in a day or two. The 2B model does not use rotary positional embeddings and uses basic positional embeddings, so the hardcoded values that you see is irrelevant there.

For the rotary embeds, the embedding grid needs to be appropriately resized/rescaled based on user input. It's generally done by using a fixed base latent resolution (and generally chosen to be the resolution that is most frequent in the dataset). In this case, that happens to be 720 // (8 * 2) x 480 // (8 * 2). You can see another example of this here for HunyuanDiT.

tin2tin · 2024-08-25T09:18:30Z

@a-r-r-o-w Btw. I'm getting this message when running 2b via Diffusers:
The config attributes {'mid_block_add_attention': True, 'sample_size': 256} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.

tin2tin · 2024-08-25T09:21:53Z

@a-r-r-o-w This is 2b doing 512x512:

-1872826932_A_rollercoaster_car_is_launched_from_a.mp4

a-r-r-o-w · 2024-08-25T09:22:03Z

Ah yes, these parameters were removed from the implementation since they were not relevant/correct. It's a harmless warning. I'll open a PR to the 2B repo in a bit to remove these from the VAE config.json

a-r-r-o-w · 2024-08-25T09:28:07Z

@a-r-r-o-w This is 2d doing 512x512:

Keep in mind that normal positional encoding, as used in 2B, is not great at generalizing to different resolutions when the training data does not include ample amount of multires examples. RoPE, as used in 5B, can somewhat deal with these issues with lesser examples, however further multires finetuning is required in Cog-2B/5B to properly address these concerns (something that the community might pick up on fairly soon hopefully). Here's a 5B 512x512 result:

cog-5b-512.webm

tin2tin · 2024-08-25T13:49:14Z

@a-r-r-o-w PM me, if you need someone to test Diffusers 5b locally on a RTX 4090 - on Windows - before the release.

yiyixuxu

thanks!
I'm aware there is some discrepancy between the the rotary and sincos embedding, but moving it to pipeline will create more discrepancy from rest of the code base so I think we can either leave it as it is for now, or maybe consider adding a positional embedding in transformer for rotary embedding too (and we can apply same patten across all relevant pipeline in a follow-up PR)
let me know what you think!

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py

src/diffusers/models/embeddings.py

yiyixuxu

I left a comment!

src/diffusers/models/embeddings.py

yiyixuxu

thanks for the PR!

a-r-r-o-w · 2024-09-03T08:31:16Z

cc @zRzRzRzRzRzRzR for visibility. Should only impact CogVideoX-2b in terms of implementation, but have no effect on final outputs

…eoX (#9263) * remove frame limit in cogvideox * remove debug prints * Update src/diffusers/models/transformers/cogvideox_transformer_3d.py * revert pipeline; remove frame limitation * revert transformer changes * address review comments * add error message * apply suggestions from review

a-r-r-o-w added 2 commits August 24, 2024 03:07

remove frame limit in cogvideox

7fa2bde

remove debug prints

22311d1

a-r-r-o-w changed the title ~~[refactor] removes the frame limititation in CogVideoX~~ [refactor] removes the frame limitation in CogVideoX Aug 24, 2024

a-r-r-o-w changed the title ~~[refactor] removes the frame limitation in CogVideoX~~ [refactor] remove the frame limitation in CogVideoX Aug 24, 2024

a-r-r-o-w requested a review from yiyixuxu August 24, 2024 01:32

a-r-r-o-w commented Aug 24, 2024

View reviewed changes

src/diffusers/models/transformers/cogvideox_transformer_3d.py Outdated Show resolved Hide resolved

yiyixuxu mentioned this pull request Aug 25, 2024

refactor 3d rope for cogvideox #9269

Merged

a-r-r-o-w added 2 commits August 26, 2024 03:20

Update src/diffusers/models/transformers/cogvideox_transformer_3d.py

f8f03a1

Merge branch 'main' into cogvideox/pipeline-followups

49b804c

yiyixuxu reviewed Aug 25, 2024

View reviewed changes

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py Outdated Show resolved Hide resolved

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py Outdated Show resolved Hide resolved

a-r-r-o-w added 5 commits August 27, 2024 12:08

Merge branch 'main' into cogvideox/pipeline-followups

c464655

revert pipeline; remove frame limitation

92a2f7e

revert transformer changes

392f726

address review comments

431ad60

add error message

555ed91

a-r-r-o-w requested a review from yiyixuxu August 27, 2024 11:16

yiyixuxu reviewed Aug 28, 2024

View reviewed changes

src/diffusers/models/embeddings.py Show resolved Hide resolved

yiyixuxu reviewed Aug 28, 2024

View reviewed changes

a-r-r-o-w commented Aug 28, 2024

View reviewed changes

src/diffusers/models/embeddings.py Show resolved Hide resolved

yiyixuxu reviewed Aug 28, 2024

View reviewed changes

src/diffusers/models/embeddings.py Show resolved Hide resolved

src/diffusers/models/embeddings.py Outdated Show resolved Hide resolved

apply suggestions from review

b3b9ecc

a-r-r-o-w requested a review from yiyixuxu September 2, 2024 10:23

Merge branch 'main' into cogvideox/pipeline-followups

bf2907f

a-r-r-o-w changed the title ~~[refactor] remove the frame limitation in CogVideoX~~ [refactor] move positional embeddings to patch embed layer for CogVideoX Sep 2, 2024

yiyixuxu reviewed Sep 2, 2024

View reviewed changes

src/diffusers/models/embeddings.py Show resolved Hide resolved

yiyixuxu approved these changes Sep 2, 2024

View reviewed changes

Merge branch 'main' into cogvideox/pipeline-followups

b31dcb8

a-r-r-o-w merged commit 9d49b45 into main Sep 3, 2024
18 checks passed

a-r-r-o-w deleted the cogvideox/pipeline-followups branch September 3, 2024 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refactor] move positional embeddings to patch embed layer for CogVideoX #9263

[refactor] move positional embeddings to patch embed layer for CogVideoX #9263

a-r-r-o-w commented Aug 24, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 24, 2024

a-r-r-o-w commented Aug 24, 2024 •

edited

Loading

tin2tin commented Aug 24, 2024

a-r-r-o-w commented Aug 25, 2024

tin2tin commented Aug 25, 2024 •

edited

Loading

a-r-r-o-w commented Aug 25, 2024

tin2tin commented Aug 25, 2024

tin2tin commented Aug 25, 2024 •

edited

Loading

a-r-r-o-w commented Aug 25, 2024

a-r-r-o-w commented Aug 25, 2024 •

edited

Loading

tin2tin commented Aug 25, 2024

yiyixuxu left a comment

yiyixuxu left a comment

yiyixuxu left a comment •

edited

Loading

a-r-r-o-w commented Sep 3, 2024

[refactor] move positional embeddings to patch embed layer for CogVideoX #9263

[refactor] move positional embeddings to patch embed layer for CogVideoX #9263

Conversation

a-r-r-o-w commented Aug 24, 2024 • edited Loading

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Aug 24, 2024

a-r-r-o-w commented Aug 24, 2024 • edited Loading

tin2tin commented Aug 24, 2024

a-r-r-o-w commented Aug 25, 2024

tin2tin commented Aug 25, 2024 • edited Loading

a-r-r-o-w commented Aug 25, 2024

tin2tin commented Aug 25, 2024

tin2tin commented Aug 25, 2024 • edited Loading

a-r-r-o-w commented Aug 25, 2024

a-r-r-o-w commented Aug 25, 2024 • edited Loading

tin2tin commented Aug 25, 2024

yiyixuxu left a comment

Choose a reason for hiding this comment

yiyixuxu left a comment

Choose a reason for hiding this comment

yiyixuxu left a comment • edited Loading

Choose a reason for hiding this comment

a-r-r-o-w commented Sep 3, 2024

a-r-r-o-w commented Aug 24, 2024 •

edited

Loading

a-r-r-o-w commented Aug 24, 2024 •

edited

Loading

tin2tin commented Aug 25, 2024 •

edited

Loading

tin2tin commented Aug 25, 2024 •

edited

Loading

a-r-r-o-w commented Aug 25, 2024 •

edited

Loading

yiyixuxu left a comment •

edited

Loading