-
Notifications
You must be signed in to change notification settings - Fork 942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Minimal Implementation of Masked Weight Loss #236
Conversation
How is the mask used by the script? Is it seperate files in the image folder, one for each images with face? Pure white mean full weight for learning, black mean don't look at it and 50% grey is look at it but not as much as the face area? Could you share the test dataset for testing, studying? This is an interesting concept. More work but I think the results are well worth it. |
Here is the dataset I showed above The masks are saved as the alpha channel of the PNGs, which as long as you don't pre-multiply saves the RGB channels intact. And prevents yet another instance of find the match like exists with captions and .npz I made these masks in Gimp, but I've also experimented with depth maps on my private dataset. You are correct, Full white areas experience normal loss, and black are completely removed from loss. The gray in these masks is 75% I believe, will run again in the next day with 50% and 25% minimums. Feel free to check my math, and some things like the *8 in the reshape are straight from @cloneofsimo, and I don't know exactly how that operates. |
Preliminary results of a crude test loo impressive. Here is a graph showing the loss average of the non masked training vs the masked training: Masked training is the one in orange. The non masked training is not yet completed... but the samples from the masked training are pretty good. I have never seen this low of a loss ever... and the results are still ok... Usually this low of a loss mean super fried model... but not this time. |
And here is a quick comparison: With masked face: Without masked face: Same seed, same prompt. This is shocking. Masked output is much more like the dataset... very interesting PR. The proportions of the non masked model are all over the place. Masked appear to have learned much better. I guess making must help the trainer to focus attention only on important aspects of the subject and not spend time learning blurry backgrounds, etc. I used photoshop to create the mask ans isolated the subject as part of the masking. Here is an example source and masked image: I applied a 50% grey mask over the body, hair, no mask over the face and black over the background on each of the 16 source images. |
This is to be expected, as the tensors are being modified before the mean square error step. The errors in the larger blacked out backgrounds are not being counted at all. My back of the envelope math says you can't account for the variation by a single factor, you'd have to run the MSE twice in order to get a comparable number to the unmasked training. This complicates tracking losses against unmasked runs, but similar runs of masked datasets will have similar losses in my experimentation. |
This is part of the reason I ran a gaussian blur on my alpha layers (though perhaps bloom might be more effective to only push higher values out, not meld evenly like blur) My thought is that allowing some training at the interface between subject and background might help with those edge details. |
Double check that your alpha channel was not pre-multiplied with the color channels. The first three layers should be exactly the same as the original image. If that was at fault, it makes the case for side loading the masks, to prevent people from doing the same. Side loading would also allow the use of 16bit grayscale depth masks natively, ie the outputs from https://github.com/thygate/stable-diffusion-webui-depthmap-script |
I am using photoshop... so it can be a bit difficult... perhaps the issue is that I tried to apply a gausian blur to the mask but perhaps it applied to the whole image? Anyhow, I redid a mask of just the subject with no blur and it turned out pretty good. I n fact it appear to have better capture body proportions also. |
The other test that could be run is whether or not EXTRA attention is beneficial. DreamArtist used a non linear scale from 0-5 instead of the linear scale here of 0-1 https://github.com/7eu7d7/DreamArtist-sd-webui-extension#attention-mask To test a linear scale from 0-2, line 521 could be changed to |
12e787c
to
9c0a644
Compare
I just switched to sideloaded masks to avoid confusion with the alpha layers. This is not backward with alpha layer masks. Also added a |
Thank you for this PR! This is quite interesting. I think masked-loss is an advanced feature and it is out of scope of the repo, but the minimal implementation is good. I also think it might be an idea to use separated images for mask, because alpha channel is bit difficult to manage. I am working on another big PR, so I will review in near future. There might be some things to consider how to implement (such as dataset features like cropping etc.) |
@kohya-ss I enjoy the projects and sharing what I come up with for my own training, and I do like that you keep a very stable codebase, so I am by no means offended if I am outside of your intended scope. May I ask why you pass all of the |
Oops, didn't mean to close it, fat thumbs, but I will leave the fork as is, so it's available for review. |
Thank you! There is no major reason for not passing |
I don't envy you for having to do it, but I look forward to it being done at some point. :) |
@AI-Casanova I tried the new .mask feature and it worked very well. I was not able to get anywhere with the photoshop alpha channed yesterday for a model but using depth map generated images named .mask has produced great results. Too bad this feature does not get integrated at the moment. Hopefully once the large change is implemented @kohya-ss might consider integrating it. |
In a way using depth map mask is like a controlnet for learning... It guide SD to learn what is important, maximising the learning potential of each images. |
That was my hope, to reduce reliance on captioning backgrounds. It is my (perhaps mistaken) feeling that the longer the caption, the more likely it is for the identity I'm trying to train to bleed out into the rest of the caption. |
I think you are right. I wonder if convnet depth_leres could be dynamically used at dataset image load time to generate a depth mash for training... using convnet as a tool to focus training vs manually creating the masks beforahand. I guess the convnet masks could be saved for reuse in other run... |
Are you talking about this? https://github.com/aim-uofa/AdelaiDepth/tree/main/LeReS I just had a thought as well, we are reshaping the mask down to a 64x64 at the point of noise. I followed cloneofsimo in using 'nearest' but wouldn't we want a max function there? Id think we'd want to train the parts of the noise that included our important subject as strongly as we ask for the subject, even if that raises the surroundings some. |
@AI-Casanova will this PR still be worked on? I don't know if this feature was added in other commit or not, or how to use it. |
@flesnuk this PR needs to be reworked from scratch because of the major changes that have happened in the mean time, but it something that I should probably work on, as it showed some promise. |
Any movement on this? |
@TingTingin Honestly it's quite fallen off my radar, in lieu of other projects. @kohya-ss Would you be interested if I take another crack at this? I'd likely put the mask loading as a TOML config option to select a folder and then search for matching names. It occurs to me that it might be quite possible to use mediapipe to auto-mask faces, with face set at one value and everything else at another. This would be potentially useful both to train faces, and to ignore them when training a costume etc. |
Wouldn't including the mask in the same folder be the most sensible since that's how the other settings have worked thus far? It would be interesting though if the abilty to use some auto generated mask was added people could potentially extend with different segmenters later |
There's a potential file naming issue with using the same folder. You could rename the files to *.mask or something, but IMO people might be more comfortable to have /dataset/ABCD.png and /dataset/mask/ABCD.png, and I believe it would make the code cleaner as well. |
Could potentially pass a mask folder argument with the default being the dataset/mask folder |
Good idea. For every folder, if masked loss is enabled, assume folder/mask/* unless specified otherwise. |
@AI-Casanova In addition, I wonder that face detection and automatic mask creation is a complex task and requires different dependencies, so it might be better as a separate repository. |
I noticed there was a ControlNet branch, but never popped in, that sounds perfect. I understand the point about auto-masking, I'll take some time to think about what that would best look like. As to the extra input and data loader changes, it's somewhat tangential to another proposal I had for you. With the inclusion of sliced VAE, and thus the ability to load larger images, it might be possible now to add data augmentations like random crop, random zoom, affine transforms etc to cached latents by loading them larger, and slicing in the data loader. This would have to be replicated for potential masks, but that isn't an insurmountable problem. |
Certainly, it would be attractive if augmentation could be performed on cached latents. However, there seems to be a subtle difference between latents retrieved from a cropped image, and cropped latents of the entire image. This is a rather annoying problem (which is why VAE's simple tiling produces checkerboard patterns). |
@AI-Casanova I've submitted a slightly reworked version of your PR, it should work with current main branch I've changed the mask loading logic to look for masks in a I've also tweaked the MSE loss calculation, mask values are now normalized using the mask mean value. I believe this makes the magnitude of the calculated loss to be less sensitive to the amount of non-black pixels in the mask - my testing agrees with this. |
Just for the sake clarification the mask logic as is makes it so that pixels which are white have the most attention Grey the second most and black the lowest? |
@recris that's amazing! I'll pull as soon as life lets me and give it a try. |
The relative attention of each pixel is still the same, grayscale value are mapped to the [0, 1] weight range. As an example, lets say you have training set with 2 black and white masks, one has a very large white area (A) and the other has a small white area (B). Without re-scaling, the computed MSE loss will be proportional to the amount of white pixels in the overall image, meaning the calculated loss in A will be (statistically) greater than the loss in B, and this may skew the training. By re-scaling we eliminate this proportionality effect. |
I worry a bit about that rescaling. In my use case, I'd mask faces at 1, bodies at .5, and backgrounds at .25 perhaps. I would think that any face should be backpropagated wrt the same per pixel loss, disregarding the reported loss, as I don't find loss graphs to be valuable. |
I've done a few runs with re-scaling and haven't found negative effects yet. Keep in mind that the relative pixel weights stay the same since everything is scaled by the same factor. The biggest improvement I found with this whole change is that certain details from the training data, like backgrounds, stopped leaking into the generated images. This was especially evident when the model is over-fitted. Previously I had to be smarter when choosing and labeling the images but now I can get the same quality with less effort put into the training set. |
Out of curiosity, are you fully zeroing your backgrounds, or leaving a bit for context? |
My backgrounds are fully black, but I am still providing background descriptions in the captions (not sure if it matters) |
Because creating human subject masks is very tedious, I've came up with a small script to automate this process: https://github.com/recris/subject-masker It uses a combination of face detection, parsing and recognition models, plus instance segmentation model to generate subject masks. It supports providing distinct weight values for face, hair, body and background. Overall the generated masks are pretty good, with the occasional mask that needs some additional cleaning in GIMP. |
All the things I was going to do """someday""" this boss already did. |
This is my first ever Pull Request so bare with me, please.
@cloneofsimo instituted weighted loss here: cloneofsimo/lora#96 based on a facial recognition mask.
Following his work somewhat (I am no programmer) I came up with this implementation. It takes the alpha channel of a PNG, and converts it to a weight mask between 0-1. This is multiplied to
noise_pred
andtarget
immediately before loss is calculated.I was unsure how you'd like to handle passing
args.masked_loss
totrain_util.py
so my PR loads masks from every image and stores them, regardless of whether the files have an alpha layer.The new argument
--masked_loss
toggles the multiplication of the mask tonoise_pred
andtarget
.I have only tested this on
train_network.py
dreambooth, as I do not have any other datasets prepared, but I do not believe that the method I implemented will impact the other methods.