Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SD3] Token limit in training #8557

Closed
Luciennnnnnn opened this issue Jun 14, 2024 · 7 comments · Fixed by #8564
Closed

[SD3] Token limit in training #8557

Luciennnnnnn opened this issue Jun 14, 2024 · 7 comments · Fixed by #8564
Labels
bug Something isn't working

Comments

@Luciennnnnnn
Copy link
Contributor

Describe the bug

I'm aware there is a PR #8506 for token limit in inference, however, the token limit in training is not processed. I guess token limit in training will affect performance.

Reproduction

no

Logs

No response

System Info

no

Who can help?

@yiyixuxu @asomoza

@Luciennnnnnn Luciennnnnnn added the bug Something isn't working label Jun 14, 2024
@Luciennnnnnn
Copy link
Contributor Author

After rereading the SD3 paper, I notice that they indeed trained with 77 token limit, so the training script is correct. However, it is strange that SD3 still consist with 77 tokens while T5 supports 512 tokens in maximum.

@AmericanPresidentJimmyCarter
Copy link
Contributor

In the paper they trained on 77 tokens, but they reported they moved to finetuning on full context T5 (512 tokens) after the release of the paper.

@asomoza
Copy link
Member

asomoza commented Jun 14, 2024

Hi, I'm taking notes about everyone's concerns about the token limit. However we need, as always, to keep the code as clean and simple as possible, until we can test or the community that using a higher token limit brings improvement in the training we have to wait.

This is one of the main reasons we didn't add the long prompt weighting as a core functionality with previous models.

It would help us a lot if the community could also test this and post their findings here or in a discussion.

Initially to test this you only need to change the max_length number in the training scripts:

However I don't see it as simple as this because we still have the 77 token limit for the clip models, so the prompt could get truncated in the wrong place for those and make the training worse. As I see it, to add this we should also add the possibility to train with a different prompt for each text_encoder. However this requires more complexity in the code and a lot of testing to prove that is worth it.

Finally we need to wait to see if even training the T5 really matters, without any of the techniques for lowering the VRAM consumption, the only ones capable of running it are 3090/4090 owners which are a minority.

Hope this helps with your concerns.

@AmericanPresidentJimmyCarter
Copy link
Contributor

It should at the least be a command line argument for the training script.

@Luciennnnnnn
Copy link
Contributor Author

In the paper they trained on 77 tokens, but they reported they moved to finetuning on full context T5 (512 tokens) after the release of the paper.

Hi, I'm curious where they state they fine-tuning on 512 tokens, I'm not aware it, any details in that?

@Luciennnnnnn
Copy link
Contributor Author

Hi, I'm taking notes about everyone's concerns about the token limit. However we need, as always, to keep the code as clean and simple as possible, until we can test or the community that using a higher token limit brings improvement in the training we have to wait.

This is one of the main reasons we didn't add the long prompt weighting as a core functionality with previous models.

It would help us a lot if the community could also test this and post their findings here or in a discussion.

Initially to test this you only need to change the max_length number in the training scripts:

However I don't see it as simple as this because we still have the 77 token limit for the clip models, so the prompt could get truncated in the wrong place for those and make the training worse. As I see it, to add this we should also add the possibility to train with a different prompt for each text_encoder. However this requires more complexity in the code and a lot of testing to prove that is worth it.

Finally we need to wait to see if even training the T5 really matters, without any of the techniques for lowering the VRAM consumption, the only ones capable of running it are 3090/4090 owners which are a minority.

Hope this helps with your concerns.

Thank you for detailed reply, I understand your concerns. FYI, pixart-sigma claimed benefits of longer tokens in training. Yes, they only uses T5 as text encoder, which is less complicated compared with SD3 that have multiple text encoders with varied token limits.

@AmericanPresidentJimmyCarter
Copy link
Contributor

In the paper they trained on 77 tokens, but they reported they moved to finetuning on full context T5 (512 tokens) after the release of the paper.

Hi, I'm curious where they state they fine-tuning on 512 tokens, I'm not aware it, any details in that?

There isn't any official correspondence, just heard from someone at SAI this released model was trained of the long T5 token context. There are other peculiarities with it, for example the qk RMSNorm appears to be missing despite it being show stabilising the model in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants