-
Notifications
You must be signed in to change notification settings - Fork 941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug/Future] bitsandbytes higher than 0.35.0 breaks training on 8bit adamW (Windows) #523
Comments
this, I thought something broke so It's seem like this to be a cause thanks! |
did u use fp16? |
I think I was using fp16 setting on |
fp16 has some problem in bnb over 0.35 |
You think if using bf16, it should work correctly? I can try in some hours. |
yeah, you can try bf16. |
The issue happened in both SD1.5 and SDXL (training loss going to nan) |
I've known of an issue with bitsandbytes since 0.35.0 and at the time no one knew what the issue was caused by they just knew to keep bitsandbytes at 0.35.0, and with the advent of SDXL I started training again and thought that that issue might have been fixed until I trained SD v1.5 and SDXL in the last two days and got bad results, seemingly overtrainings, blotchyness, nans, etc... So I went back and tried to track down the issue and what might have happened from bitsandbytes 0.35.0 and after, and it turns out it was the probable mistaken indentation of a section of code in the 8bit dynamic map code in bitsandbytes! Which also explains why people for months have been seeing this issue only when using the 8bit optimizers like AdamW8bit with bitsandbytes >0.35.0, and not when using just AdamW for instance. Here are some links related to the issue: And in the second link ArrowM tracked down the issue in this commit of bitsandbytes:
And RossM noticied in the actual commit here: bitsandbytes-foundation/bitsandbytes@2f2063b#r109622091
|
Thank u for reporting,so TimDettmers/bitsandbytes havent fix yet? bitsandbytes-foundation/bitsandbytes#262 |
It has not been fixed, no, it's been in the code since 0.35.0, and like I mentioned we knew back at the end of 2022 that 'something' happened to bitsandbytes after 0.35.0 but just didn't know what. Seems like it shouldn't have been indented and is in fact a bug as RossM pointed out:
It was probably an accident, and Tim Dettmers didn't close that PR or comment on it, so not sure why ArrowM decided to close it with no comments either. There are currently 28 open PRs on bitsandbytes and 2 weeks ago 7 were merged, but the last time any were merged was May 7th, so it makes sense that Tim is very busy and hasn't had time to read all the issues(318)/PRs(28) to even see that someone mentioned about the indent issue from 0.35.0 and on, etc. |
Thanks for the info! Sadly tried to fix the ident but I keep getting my loss to 1.0/nan. So I'm not sure if there's something more besides this. @sdbds I'm using on bitsandbytes 0.35.0, and as far I know, at least on SD 1.5 training, if you set unet_lr, learning rate doesn't gets used. These give me good results for LoRA training at 1024x1024 in models based on SD1.5 Are these values too high for bitsandbytes 0.36.0+? |
I've tested BNB 38.1 vs 41.1 using AdamW8bit and it seems repaired now in 41.1. Here are results of fine tuning Stable Diffusion 1.5 (unfreezing both unet and text encoder) some characters from Final Fantasy. The top row of labels is the prompt used for inference for that given column. The top row of outputs is SD1.5. The middle two rows are fine tuned on a few thousand images using standard supervised (detailed human-written labels) fine tuning using either BNB 38.1 or BNB 41.1 as marked. The bottom row is just a reference for each character chosen from the training data since I'm sure plenty of people looking at this are not familiar with these specific subjects. I've trained this data or variations with the same software probably dozens or hundreds of time using known working settings for learning rate, batch size, epochs, etc. as it is my personal "reference" set for debugging training issues. My conclusions: Most obviously, using BNB 38.1 turned "morgan freeman" into a caucasian female with a sort of video-game render aesthetic (an aesthetic which is present in all the training data). Emma Watson is also forgotten and turned into a "video game render" aesthetic. No data for Morgan Freeman or Emma Watson is present in the fine tuning data and I would typically not expect such significant loss in prior knowledge. More generally, using BNB 41.1 the quality is higher and there is either no or significantly less prior loss. BNB 41.1 AdamW8bit behaves as I would normally expect, and like the reference Torch AdamW. |
Finally, managed to get some time to train and I can confirm, that latest BNB fixes it. Thanks to all and @victorchall! |
Hi there! When updating bitsandbytes to any version higher than 0.35.0, all trainings get loss value of nan.
I've tested with cu116, cu117, cu118 and cu121 binaries (with torch+cu116, +cu117, +cu118 and +cu121 respectively) and the issue happens on all of them.
I know it is more for a future issue, but if somehow the binaries have to be updated, they may suffer this issue.
The bitsandbytes whls were obtained from https://github.com/jllllll/bitsandbytes-windows-webui
The text was updated successfully, but these errors were encountered: