-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot restart training after training tenc 2 AND using fused_backward_pass #1369
Comments
Thank you for opening this! Unfortunately, I cannot reproduce this issue. I think it may be caused by the difference of the version of PyTorch. Which version are you using? I'm using 2.1.2. |
Sorry, I should have provided reproduction steps for this. Here they are now. The bug reproduces even when on a clean checkout of the dev branch. (I haven't used main recently).
(I have to install xformers with a second pip command rather than adding it to the parameters of the first 'pip install' line, due to version incompatibilities with torch. I still get You asked about my torch version. Here it is (from
Then I run training:
This should hopefully reproduce the issue for you. Thank you for your attention so far. :) If I re-run these steps, but instead of If I re-run the steps, keeping This seems to be an important bug to fix for SDXL training, as I am seeing amazing results from training tenc 2 at a very low rate of 1e-10. This training rate has not been possible with bf16 training as it is below the precision that bf16 is able to handle. But with the fp32 training made possible with fused_backwards_pass and tenc 2 being trained, I see impressive image quality changes. I just cannot restart training if I stop! By the way, I saw a new warning now that I've reinstalled sd-scripts to make these reproduction instructions:
I didn't have that warning before, so I don't think it's relevant to this bug report. |
Update to my long reproduction steps above: If I remove So to reproduce the issue, all three of these options need to be set:
Being able to use .ckpt instead of .safetensors to allow me to continue training is great news, as it provides a workaround way to restart training even without this bug being fixed. Edit: I just got the error message again, even with .ckpt being used. :-/ Not sure why it worked for that one test run, but it seems that the bug does not need safetensors after all. |
Thank you for the detailed steps! The dev branch recommend In addition, I don't think the format of the file (.ckpt or .safetensors) affect the issue. So the issue may depend on something special... |
Okay, so I took my build and installed those versions:
which got me:
(I had to include torchvision on that installation line to get one that worked with torch 2.1.2) I made sure to regenerate a fresh .ckpt file, and didn't pick up the one that I'd already made with the later torch version that I previously had. But, the same error message still reproduces, even with a new .ckpt being written out by sd-scripts, and torch/xformers set to these older versions. |
I started trying to get more information about this. Since the error message suggested that I add
Tenc2 is mentioned in that trace:
Any thoughts? |
Okay, I haven't fix it, but I've now found a workaround that allows training of tenc2 to continue, and it also indicates roughly where the trouble is likely to be coming from. The workaround is to edit the torch library file: and you can successfully continue training a .safetensors or .ckpt that was written out by sd-scripts while training. The issue seems to be related this this warning, which is seen even when using the recommended torch version 2.1.2:
I'll let someone that actually knows what they're doing with this figure out the correct fix. And I'm still not sure why this issue doesn't trigger when starting training from |
I encountered a similar issue. When enabling |
Add
|
@Jannchie, I didn't know use_reentrant could be passed into gradient_checkpointing_enable() like that. That's great news, as it lets my issue be fixed with a sd-scripts change, rather than hacking the library function like I was doing. As for your hang, is it specific to AdamW? I've only been using Adafactor. Is there some advantage to using AdamW by the way? I haven't tried that. |
I’m a beginner and not quite sure about the specific effects, but I am attempting to replicate the settings from https://huggingface.co/cagliostrolab/animagine-xl-3.1. Regarding the freeze issue, by referring to this issue, I found that specifying the saving format as safetensors (instead of the default diffusers format) can resolve the problem. |
I've ran into the same problem. In my case, the issue only started to appear after I switched to
|
Interesting. I'm also on 24GB, and batch size 4 works great for me. Have you
|
My bad -- turned out that I based my modifications on the |
If you finetune SDXL base with:
Then it will train fine. But if you stop training and restart by training from the e.g.
<whatever>-step00001000.safetensors
file, you get this error message:This doesn't happen if you only train te1 and the unet. It also only happens when you use --fused_backward_pass.
Full call stack:
(I also mentioned this bug back when the original pull request occurred: see #1259)
The text was updated successfully, but these errors were encountered: