[SD3] Token limit in training #8557

Luciennnnnnn · 2024-06-14T12:04:56Z

Describe the bug

I'm aware there is a PR #8506 for token limit in inference, however, the token limit in training is not processed. I guess token limit in training will affect performance.

Reproduction

no

Logs

No response

System Info

no

Who can help?

@yiyixuxu @asomoza

Luciennnnnnn · 2024-06-14T12:34:05Z

After rereading the SD3 paper, I notice that they indeed trained with 77 token limit, so the training script is correct. However, it is strange that SD3 still consist with 77 tokens while T5 supports 512 tokens in maximum.

AmericanPresidentJimmyCarter · 2024-06-14T13:32:06Z

In the paper they trained on 77 tokens, but they reported they moved to finetuning on full context T5 (512 tokens) after the release of the paper.

asomoza · 2024-06-14T16:00:13Z

Hi, I'm taking notes about everyone's concerns about the token limit. However we need, as always, to keep the code as clean and simple as possible, until we can test or the community that using a higher token limit brings improvement in the training we have to wait.

This is one of the main reasons we didn't add the long prompt weighting as a core functionality with previous models.

It would help us a lot if the community could also test this and post their findings here or in a discussion.

Initially to test this you only need to change the max_length number in the training scripts:

diffusers/examples/dreambooth/train_dreambooth_lora_sd3.py

Line 843 in f96e4a1

max_length=77,

However I don't see it as simple as this because we still have the 77 token limit for the clip models, so the prompt could get truncated in the wrong place for those and make the training worse. As I see it, to add this we should also add the possibility to train with a different prompt for each text_encoder. However this requires more complexity in the code and a lot of testing to prove that is worth it.

Finally we need to wait to see if even training the T5 really matters, without any of the techniques for lowering the VRAM consumption, the only ones capable of running it are 3090/4090 owners which are a minority.

Hope this helps with your concerns.

AmericanPresidentJimmyCarter · 2024-06-14T17:20:23Z

It should at the least be a command line argument for the training script.

Luciennnnnnn · 2024-06-15T02:27:33Z

In the paper they trained on 77 tokens, but they reported they moved to finetuning on full context T5 (512 tokens) after the release of the paper.

Hi, I'm curious where they state they fine-tuning on 512 tokens, I'm not aware it, any details in that?

Luciennnnnnn · 2024-06-15T02:33:48Z

Hi, I'm taking notes about everyone's concerns about the token limit. However we need, as always, to keep the code as clean and simple as possible, until we can test or the community that using a higher token limit brings improvement in the training we have to wait.

This is one of the main reasons we didn't add the long prompt weighting as a core functionality with previous models.

It would help us a lot if the community could also test this and post their findings here or in a discussion.

Initially to test this you only need to change the max_length number in the training scripts:

diffusers/examples/dreambooth/train_dreambooth_lora_sd3.py

Line 843 in f96e4a1

max_length=77,

However I don't see it as simple as this because we still have the 77 token limit for the clip models, so the prompt could get truncated in the wrong place for those and make the training worse. As I see it, to add this we should also add the possibility to train with a different prompt for each text_encoder. However this requires more complexity in the code and a lot of testing to prove that is worth it.

Finally we need to wait to see if even training the T5 really matters, without any of the techniques for lowering the VRAM consumption, the only ones capable of running it are 3090/4090 owners which are a minority.

Hope this helps with your concerns.

Thank you for detailed reply, I understand your concerns. FYI, pixart-sigma claimed benefits of longer tokens in training. Yes, they only uses T5 as text encoder, which is less complicated compared with SD3 that have multiple text encoders with varied token limits.

AmericanPresidentJimmyCarter · 2024-06-15T04:40:37Z

In the paper they trained on 77 tokens, but they reported they moved to finetuning on full context T5 (512 tokens) after the release of the paper.

Hi, I'm curious where they state they fine-tuning on 512 tokens, I'm not aware it, any details in that?

There isn't any official correspondence, just heard from someone at SAI this released model was trained of the long T5 token context. There are other peculiarities with it, for example the qk RMSNorm appears to be missing despite it being show stabilising the model in the paper.

Luciennnnnnn added the bug Something isn't working label Jun 14, 2024

asomoza mentioned this issue Jun 14, 2024

[SD3 Training] T5 token limit #8564

Merged

asomoza closed this as completed in #8564 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SD3] Token limit in training #8557

[SD3] Token limit in training #8557

Luciennnnnnn commented Jun 14, 2024

Luciennnnnnn commented Jun 14, 2024

AmericanPresidentJimmyCarter commented Jun 14, 2024

asomoza commented Jun 14, 2024

AmericanPresidentJimmyCarter commented Jun 14, 2024

Luciennnnnnn commented Jun 15, 2024

Luciennnnnnn commented Jun 15, 2024

AmericanPresidentJimmyCarter commented Jun 15, 2024

[SD3] Token limit in training #8557

[SD3] Token limit in training #8557

Comments

Luciennnnnnn commented Jun 14, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

Luciennnnnnn commented Jun 14, 2024

AmericanPresidentJimmyCarter commented Jun 14, 2024

asomoza commented Jun 14, 2024

AmericanPresidentJimmyCarter commented Jun 14, 2024

Luciennnnnnn commented Jun 15, 2024

Luciennnnnnn commented Jun 15, 2024

AmericanPresidentJimmyCarter commented Jun 15, 2024