[Bug/Future] bitsandbytes higher than 0.35.0 breaks training on 8bit adamW (Windows) #523

Panchovix · 2023-05-20T19:52:28Z

Hi there! When updating bitsandbytes to any version higher than 0.35.0, all trainings get loss value of nan.

I've tested with cu116, cu117, cu118 and cu121 binaries (with torch+cu116, +cu117, +cu118 and +cu121 respectively) and the issue happens on all of them.

I know it is more for a future issue, but if somehow the binaries have to be updated, they may suffer this issue.

The bitsandbytes whls were obtained from https://github.com/jllllll/bitsandbytes-windows-webui

enranime · 2023-05-23T01:54:22Z

this, I thought something broke so It's seem like this to be a cause thanks!

sdbds · 2023-07-28T06:59:16Z

did u use fp16?

Panchovix · 2023-07-28T19:12:04Z

did u use fp16?

I think I was using fp16 setting on accelerate config

sdbds · 2023-07-29T09:54:34Z

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

Panchovix · 2023-07-29T18:43:08Z

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

You think if using bf16, it should work correctly? I can try in some hours.

sdbds · 2023-07-29T18:57:02Z

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

You think if using bf16, it should work correctly? I can try in some hours.

yeah, you can try bf16.
i think fp16 problem form SDXL vae.
did u train on SD1.5 or SDXL?

Panchovix · 2023-07-29T19:15:22Z

did u use fp16?

I think I was using fp16 setting on accelerate config

fp16 has some problem in bnb over 0.35

You think if using bf16, it should work correctly? I can try in some hours.

yeah, you can try bf16. i think fp16 problem form SDXL vae. did u train on SD1.5 or SDXL?

The issue happened in both SD1.5 and SDXL (training loss going to nan)

Panchovix · 2023-07-29T19:26:06Z

Just tested with BF16 on SD1.5 and it still goes to nan/1+.

bitsandbytes 0.41.0

sdbds · 2023-07-31T07:43:45Z

Just tested with BF16 on SD1.5 and it still goes to nan/1+.

bitsandbytes 0.41.0

bnb over 0.35 has more weight on LR
you can use less LR such as 2e-6~2e-5 on it

2kpr · 2023-07-31T13:28:06Z

I've known of an issue with bitsandbytes since 0.35.0 and at the time no one knew what the issue was caused by they just knew to keep bitsandbytes at 0.35.0, and with the advent of SDXL I started training again and thought that that issue might have been fixed until I trained SD v1.5 and SDXL in the last two days and got bad results, seemingly overtrainings, blotchyness, nans, etc...

So I went back and tried to track down the issue and what might have happened from bitsandbytes 0.35.0 and after, and it turns out it was the probable mistaken indentation of a section of code in the 8bit dynamic map code in bitsandbytes!

Which also explains why people for months have been seeing this issue only when using the 8bit optimizers like AdamW8bit with bitsandbytes >0.35.0, and not when using just AdamW for instance.

Here are some links related to the issue:
ShivamShrirao/diffusers#178
bitsandbytes-foundation/bitsandbytes#152

And in the second link ArrowM tracked down the issue in this commit of bitsandbytes:

This is a total shot in the dark, but I wonder if bitsandbytes/functional.py:218-223 was accidentally indented in bitsandbytes-foundation/bitsandbytes@2f2063b

And RossM noticied in the actual commit here: bitsandbytes-foundation/bitsandbytes@2f2063b#r109622091

The indent here looks clearly wrong, this repeats the additional_items for each value of exponent bits while it should apply to only the last value.

sdbds · 2023-07-31T14:12:34Z

I've known of an issue with bitsandbytes since 0.35.0 and at the time no one knew what the issue was caused by they just knew to keep bitsandbytes at 0.35.0, and with the advent of SDXL I started training again and thought that that issue might have been fixed until I trained SD v1.5 and SDXL in the last two days and got bad results, seemingly overtrainings, blotchyness, nans, etc...

So I went back and tried to track down the issue and what might have happened from bitsandbytes 0.35.0 and after, and it turns out it was the probable mistaken indentation of a section of code in the 8bit dynamic map code in bitsandbytes!

Which also explains why people for months have been seeing this issue only when using the 8bit optimizers like AdamW8bit with bitsandbytes >0.35.0, and not when using just AdamW for instance.

Here are some links related to the issue: ShivamShrirao/diffusers#178 TimDettmers/bitsandbytes#152

And in the second link ArrowM tracked down the issue in this commit of bitsandbytes:

This is a total shot in the dark, but I wonder if bitsandbytes/functional.py:218-223 was accidentally indented in TimDettmers/bitsandbytes@2f2063b

And RossM noticied in the actual commit here: TimDettmers/bitsandbytes@2f2063b#r109622091

The indent here looks clearly wrong, this repeats the additional_items for each value of exponent bits while it should apply to only the last value.

Thank u for reporting,so TimDettmers/bitsandbytes havent fix yet?

bitsandbytes-foundation/bitsandbytes#262
i found a pr for fixing but closed.
idk if it is bug or new feature...

2kpr · 2023-07-31T14:24:22Z

Thank u for reporting,so TimDettmers/bitsandbytes havent fix yet?

bitsandbytes-foundation/bitsandbytes#262
i found a pr for fixing but closed.
idk if it is bug or new feature...

It has not been fixed, no, it's been in the code since 0.35.0, and like I mentioned we knew back at the end of 2022 that 'something' happened to bitsandbytes after 0.35.0 but just didn't know what.

Seems like it shouldn't have been indented and is in fact a bug as RossM pointed out:

The indent here looks clearly wrong, this repeats the additional_items for each value of exponent bits while it should apply to only the last value.

It was probably an accident, and Tim Dettmers didn't close that PR or comment on it, so not sure why ArrowM decided to close it with no comments either.

There are currently 28 open PRs on bitsandbytes and 2 weeks ago 7 were merged, but the last time any were merged was May 7th, so it makes sense that Tim is very busy and hasn't had time to read all the issues(318)/PRs(28) to even see that someone mentioned about the indent issue from 0.35.0 and on, etc.

Panchovix · 2023-07-31T17:31:14Z

Thanks for the info! Sadly tried to fix the ident but I keep getting my loss to 1.0/nan.

So I'm not sure if there's something more besides this.

@sdbds I'm using
learning_rate = 0.0005 (5e-4)
text_encoder_lr=5e-5
unet_lr=0.0002125 (2.125e-4)

on bitsandbytes 0.35.0, and as far I know, at least on SD 1.5 training, if you set unet_lr, learning rate doesn't gets used. These give me good results for LoRA training at 1024x1024 in models based on SD1.5

Are these values too high for bitsandbytes 0.36.0+?

victorchall · 2023-08-31T14:41:41Z

I've tested BNB 38.1 vs 41.1 using AdamW8bit and it seems repaired now in 41.1. Here are results of fine tuning Stable Diffusion 1.5 (unfreezing both unet and text encoder) some characters from Final Fantasy.

The top row of labels is the prompt used for inference for that given column.

The top row of outputs is SD1.5.

The middle two rows are fine tuned on a few thousand images using standard supervised (detailed human-written labels) fine tuning using either BNB 38.1 or BNB 41.1 as marked.

The bottom row is just a reference for each character chosen from the training data since I'm sure plenty of people looking at this are not familiar with these specific subjects.

I've trained this data or variations with the same software probably dozens or hundreds of time using known working settings for learning rate, batch size, epochs, etc. as it is my personal "reference" set for debugging training issues.

My conclusions: Most obviously, using BNB 38.1 turned "morgan freeman" into a caucasian female with a sort of video-game render aesthetic (an aesthetic which is present in all the training data). Emma Watson is also forgotten and turned into a "video game render" aesthetic. No data for Morgan Freeman or Emma Watson is present in the fine tuning data and I would typically not expect such significant loss in prior knowledge. More generally, using BNB 41.1 the quality is higher and there is either no or significantly less prior loss.

BNB 41.1 AdamW8bit behaves as I would normally expect, and like the reference Torch AdamW.

Panchovix · 2023-09-11T03:10:28Z

Finally, managed to get some time to train and I can confirm, that latest BNB fixes it. Thanks to all and @victorchall!

Panchovix mentioned this issue Jul 6, 2023

Add Paged/ adam8bit/lion8bit for Sdxl bitsandbytes 0.39.1 cuda118 on windows #623

Merged

KohakuBlueleaf mentioned this issue Jul 31, 2023

Wrong indented lines cause bugs for a long time bitsandbytes-foundation/bitsandbytes#659

Closed

2kpr mentioned this issue Aug 1, 2023

What the hell is wrong with the repo, getting all weird images, don't care the steps.😣 ShivamShrirao/diffusers#230

Open

Panchovix closed this as completed Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug/Future] bitsandbytes higher than 0.35.0 breaks training on 8bit adamW (Windows) #523

[Bug/Future] bitsandbytes higher than 0.35.0 breaks training on 8bit adamW (Windows) #523

Panchovix commented May 20, 2023 •

edited

Loading

enranime commented May 23, 2023

sdbds commented Jul 28, 2023

Panchovix commented Jul 28, 2023

sdbds commented Jul 29, 2023

Panchovix commented Jul 29, 2023

sdbds commented Jul 29, 2023

Panchovix commented Jul 29, 2023

Panchovix commented Jul 29, 2023

sdbds commented Jul 31, 2023

2kpr commented Jul 31, 2023 •

edited

Loading

sdbds commented Jul 31, 2023

2kpr commented Jul 31, 2023 •

edited

Loading

Panchovix commented Jul 31, 2023

victorchall commented Aug 31, 2023 •

edited

Loading

Panchovix commented Sep 11, 2023

[Bug/Future] bitsandbytes higher than 0.35.0 breaks training on 8bit adamW (Windows) #523

[Bug/Future] bitsandbytes higher than 0.35.0 breaks training on 8bit adamW (Windows) #523

Comments

Panchovix commented May 20, 2023 • edited Loading

enranime commented May 23, 2023

sdbds commented Jul 28, 2023

Panchovix commented Jul 28, 2023

sdbds commented Jul 29, 2023

Panchovix commented Jul 29, 2023

sdbds commented Jul 29, 2023

Panchovix commented Jul 29, 2023

Panchovix commented Jul 29, 2023

sdbds commented Jul 31, 2023

2kpr commented Jul 31, 2023 • edited Loading

sdbds commented Jul 31, 2023

2kpr commented Jul 31, 2023 • edited Loading

Panchovix commented Jul 31, 2023

victorchall commented Aug 31, 2023 • edited Loading

Panchovix commented Sep 11, 2023

Panchovix commented May 20, 2023 •

edited

Loading

2kpr commented Jul 31, 2023 •

edited

Loading

2kpr commented Jul 31, 2023 •

edited

Loading

victorchall commented Aug 31, 2023 •

edited

Loading