What will happen if CLIP image representation is used to replace SSL representation? #34

tanbuzheng · 2024-05-30T16:52:31Z

Hi, author!
Thanks for your sharing! You do an impressive work!
I have two question.
The first is what will happen if CLIP image representation is used to replace SSL representation in the first two stages.
The second is why not also adopt a diffusion model in the third stage? Compared with the diffusion models, what are the advantages of using mage?

Looking forward to your reply!

LTH14 · 2024-05-30T17:33:01Z

Thanks for your interest! You can definitely use CLIP image representation, or in general, any representation to replace the Moco v3 representation. In the paper, we mainly focus on the unconditional generation setting, where labels are not available. Therefore, we don't use CLIP in the paper as it uses text data to train the encoder, but it is definitely possible.

The third stage can actually be any modern image generator. In Table 1 and Figure 2, we show that RCG significantly improves all these generators, no matter MAGE or diffusion models. One advantage of MAGE is that it achieves a much better unconditional generation performance on its own (compared with diffusion models). Therefore, when combined with RCG, MAGE achieves the best unconditional generation performance among all competitors.

tanbuzheng · 2024-05-31T03:03:35Z

Thanks for your reply!
I have limited computing resources, only 1-2 3090 GPUs. Does it support training the diffusion model on 256x256 resolution?
If I just want to train MAGE on imagenet1k, how long will it take?

LTH14 · 2024-05-31T03:58:53Z

The representation diffusion model can be trained on a few GPUs. However, MAGE, or the image diffusion model (DiT, LDM, ADM) needs much more -- you can refer to Table 11 in the appendix for specific training time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What will happen if CLIP image representation is used to replace SSL representation? #34

What will happen if CLIP image representation is used to replace SSL representation? #34

tanbuzheng commented May 30, 2024

LTH14 commented May 30, 2024

tanbuzheng commented May 31, 2024

LTH14 commented May 31, 2024 •

edited

Loading

What will happen if CLIP image representation is used to replace SSL representation? #34

What will happen if CLIP image representation is used to replace SSL representation? #34

Comments

tanbuzheng commented May 30, 2024

LTH14 commented May 30, 2024

tanbuzheng commented May 31, 2024

LTH14 commented May 31, 2024 • edited Loading

LTH14 commented May 31, 2024 •

edited

Loading