Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What will happen if CLIP image representation is used to replace SSL representation? #34

Open
tanbuzheng opened this issue May 30, 2024 · 3 comments

Comments

@tanbuzheng
Copy link

Hi, author!
Thanks for your sharing! You do an impressive work!
I have two question.
The first is what will happen if CLIP image representation is used to replace SSL representation in the first two stages.
The second is why not also adopt a diffusion model in the third stage? Compared with the diffusion models, what are the advantages of using mage?

Looking forward to your reply!

@LTH14
Copy link
Owner

LTH14 commented May 30, 2024

Thanks for your interest! You can definitely use CLIP image representation, or in general, any representation to replace the Moco v3 representation. In the paper, we mainly focus on the unconditional generation setting, where labels are not available. Therefore, we don't use CLIP in the paper as it uses text data to train the encoder, but it is definitely possible.

The third stage can actually be any modern image generator. In Table 1 and Figure 2, we show that RCG significantly improves all these generators, no matter MAGE or diffusion models. One advantage of MAGE is that it achieves a much better unconditional generation performance on its own (compared with diffusion models). Therefore, when combined with RCG, MAGE achieves the best unconditional generation performance among all competitors.

@tanbuzheng
Copy link
Author

Thanks for your reply!
I have limited computing resources, only 1-2 3090 GPUs. Does it support training the diffusion model on 256x256 resolution?
If I just want to train MAGE on imagenet1k, how long will it take?

@LTH14
Copy link
Owner

LTH14 commented May 31, 2024

The representation diffusion model can be trained on a few GPUs. However, MAGE, or the image diffusion model (DiT, LDM, ADM) needs much more -- you can refer to Table 11 in the appendix for specific training time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants