-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pipeline_stable_diffusion_3_inpaint.py for SD3 Inference #8709
Conversation
Our teammate has implemented Inpaint pipeline for SD3. Could you review this PR? @yiyixuxu @sayakpaul Thanks! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR!
my main feedback is instead of follow the deprecated inpaint_legacy pipeline, can we match the SD/SDXL inpaint pipeline?
I think it is ok to only implement for use-case when a regular text-to-image transformer checkpoint is used (this means we should match the algorithm in SD inpaint pipeline when unet only have 4 channels, not 9) , we can refactor later when we have a sd3 inpaint checkpoint.
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_inpaint.py
Show resolved
Hide resolved
negative_prompt_2: Optional[Union[str, List[str]]] = None, | ||
negative_prompt_3: Optional[Union[str, List[str]]] = None, | ||
num_images_per_prompt: Optional[int] = 1, | ||
add_predicted_noise: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think we support this in any other pipelines - is it ok to remove it? what's the use case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed it.
noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype) | ||
|
||
# get latents | ||
init_latents = self.scheduler.scale_noise(init_latents, timestep, noise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think we should have same behavior as other inpainting pipelines, where when strength=1
, init_latent
is pure noise
diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py
Line 716 in d5dd8df
latents = noise if is_strength_max else self.scheduler.add_noise(image_latents, noise, timestep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I added the is_strength_max
in the pipeline.
init_latents_proper = self.scheduler.scale_noise( | ||
init_latents_orig, torch.tensor([t]), noise_pred_uncond | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we match the SD and SDXL inpainting pipeline when using the regular unet checkpoint
diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py
Line 1272 in d5dd8df
if num_channels_unet == 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I updated it with checking the number of transformer channels before running the denoising process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I wasn't clear
When I said "can we match the SD and SDXL inpainting pipeline when using the regular unet checkpoint", I didn't mean that we need to check the number of transformer channels, although it is ok if you add the check.
SD/SDXL inpainting pipelines support both inpainting-specific checkpoints (when num_channels_unet==9) and regular text-to-image checkpoint (when num_channels_unet =4); I think the algorithm and overall code structure fo SD3 Inpaiting pipeline should match very closely with SD/SDXL inpainting, but you can ignore the part of logic in these pipelines that only that applies to inpainting-specific checkpoints.
the current implementation of this pipeline matches the inpainting_legacy, which we deprecated and is slightly different from SDXL and SD, both in code structure and the actual algorithm
Hi @yiyixuxu Can you review the new version? Thanks a lot! |
"negative_pooled_prompt_embeds", negative_pooled_prompt_embeds | ||
) | ||
|
||
init_latents_proper = self.scheduler.scale_noise(init_latents_orig, torch.tensor([t]), noise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can see the algorithm here is different from
diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py
Line 1280 in 7db8c3e
noise_timestep = timesteps[i + 1] |
e.g. init_latents_proper
should contain the same level of noise as latents
( latents
here is after taking the secheduler.step
so technically it is x_t-1
, hence we we passed next timestep to add_noise
in SD/SDXL inpaint (here in this version we just use current timestep)
noise_timestep = timesteps[i + 1]
init_latents_proper = self.scheduler.add_noise(
init_latents_proper, noise, torch.tensor([noise_timestep])
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We update it.
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_inpaint.py
Show resolved
Hide resolved
init_latents_proper = self.scheduler.scale_noise( | ||
init_latents_orig, torch.tensor([t]), noise_pred_uncond | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I wasn't clear
When I said "can we match the SD and SDXL inpainting pipeline when using the regular unet checkpoint", I didn't mean that we need to check the number of transformer channels, although it is ok if you add the check.
SD/SDXL inpainting pipelines support both inpainting-specific checkpoints (when num_channels_unet==9) and regular text-to-image checkpoint (when num_channels_unet =4); I think the algorithm and overall code structure fo SD3 Inpaiting pipeline should match very closely with SD/SDXL inpainting, but you can ignore the part of logic in these pipelines that only that applies to inpainting-specific checkpoints.
the current implementation of this pipeline matches the inpainting_legacy, which we deprecated and is slightly different from SDXL and SD, both in code structure and the actual algorithm
Thanks for your PR for SD3 inpainting!
|
Thanks for the comment. I will update it based on your idea tomorrow. It is great that you will provide checkpoint for inpainting version by 33 channels. Would like me to add your ideas into this PR or creat a new PR for the 33 channels version later? |
@yiyixuxu @George0726 Could you review the new update? Thanks. We follow your suggestion on using SDXL and SD inpaint pipeline and also consider @George0726 's idea on keeping the if-else check for 33 channels and implement VAE "normalize" for mask input. |
@George0726 Could you try your 33 channel checkpoint in this PR? Thanks! Let me known if you meet any new error. |
@IrohXu Sure. I have tested and modified some parts to make it work.
Here are some results of my current models: prompt: red hair |
@George0726 Thanks a lot! I have added your improved codes into this PR. |
@a-r-r-o-w can you give this a review too if you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work, this looks great! Just a few small requests that need addressing
def get_dummy_inputs(self, device, seed=0): | ||
image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device) | ||
mask_image = torch.ones((1, 1, 32, 32)).to(device) | ||
image = image / 2 + 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiyixuxu I'm curious why we do this here and in the other SD3 tests. floats_tensors
returns values in [0, 1]
. This statement makes the image tensors have values in range [0.5, 1]
. Shouldn't the input images be in range [-1, 1]
(unless I've missed something SD3-specific)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiyixuxu should I modify it now? or wait for you create a new PR to fix all issues in SD3 test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yea it is a mistake here
maybe you can fix it for this test and then we fix it everywhere else in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just fixed it.
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_inpaint.py
Show resolved
Hide resolved
num_images_per_prompt: Optional[int] = 1, | ||
generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None, | ||
latents: Optional[torch.FloatTensor] = None, | ||
prompt_embeds: Optional[torch.FloatTensor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit that can be addressed in another PR for all SD3 and similar torch.FloatTensor
hints: #7535.
src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3_inpaint.py
Show resolved
Hide resolved
is_strength_max = strength == 1.0 | ||
|
||
# 5. Preprocess mask and image | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`tuple`. When returning a tuple, the first element is a list with the generated images. | ||
""" | ||
|
||
callback = kwargs.pop("callback", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, these callbacks were deprecated a while back, no? @yiyixuxu. We can remove them here if that's the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we don't need to accept "callback" for new pipeline!
# call the callback, if provided | ||
if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0): | ||
progress_bar.update() | ||
if callback is not None and i % callback_steps == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can be removed
@a-r-r-o-w Thanks for the comments. I have updated the code based on your suggestions. For this first issue, I think we should wait for @yiyixuxu reply to us. It might be solved by another PR I think as all SD3 pipelines have it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! looking great!
I think we can merge this soon
`tuple`. When returning a tuple, the first element is a list with the generated images. | ||
""" | ||
|
||
callback = kwargs.pop("callback", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we don't need to accept "callback" for new pipeline!
callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None, | ||
callback_on_step_end_tensor_inputs: List[str] = ["latents"], | ||
max_sequence_length: int = 256, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**kwargs, |
callback = kwargs.pop("callback", None) | ||
callback_steps = kwargs.pop("callback_steps", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callback = kwargs.pop("callback", None) | |
callback_steps = kwargs.pop("callback_steps", None) | |
if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)): | |
callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs |
pooled_prompt_embeds = torch.cat([negative_pooled_prompt_embeds, pooled_prompt_embeds], dim=0) | ||
|
||
# 3. Preprocess image and mask | ||
image = self.image_processor.preprocess(image, height, width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we processing the image twice? we did it again in line 1008, no?
|
||
if not output_type == "latent": | ||
condition_kwargs = {} | ||
if isinstance(self.vae, AsymmetricAutoencoderKL): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this? I don't think SD3 works with this vae, no?
def get_dummy_inputs(self, device, seed=0): | ||
image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device) | ||
mask_image = torch.ones((1, 1, 32, 32)).to(device) | ||
image = image / 2 + 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yea it is a mistake here
maybe you can fix it for this test and then we fix it everywhere else in a separate PR?
@yiyixuxu @a-r-r-o-w I have updated it based on your comments. Thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
@IrohXu can you run |
@George0726 let us know when your checkpoints are ready :) |
@IrohXu can you make sure the tests pass? the new inpainting tests are failing here |
I have uploaded the alpha version of SD3 inpainting model. I don't have enough data and GPUs for large-scale pre-training. |
@yiyixuxu I tested it locally in my machine today, it seems it can pass all failed cases. Do you know how this error appear differently in different test environment?
Here is my log:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased the branch, can you git pull and test it again to make sure everything works?
we had this PR that may affect inpaint pipeline #8678
f"After adjusting the num_inference_steps by strength parameter: {strength}, the number of pipeline" | ||
f"steps is {num_inference_steps} which is < 1 and not appropriate for this pipeline." | ||
) | ||
latent_timestep = timesteps[:1].repeat(batch_size * num_inference_steps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latent_timestep = timesteps[:1].repeat(batch_size * num_inference_steps) | |
latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt) |
Hi, does this inpainting pipeline support loras? |
Hi, sadly no, it's missing the |
Hi, I see that now And is it possible to use the lora sd3 dreambooth script with the inpainting checkpoint? |
* Add pipeline_stable_diffusion_3_inpaint --------- Co-authored-by: Xu Cao <xucao2@jrehg-work-01.cs.illinois.edu> Co-authored-by: IrohXu <irohcao@gmail.com> Co-authored-by: YiYi Xu <yixu310@gmail.com>
What does this PR do?
This PR support inpaint pipeline in stable diffusion 3. It follows the mask inpainting idea of
StableDiffusionInpaintPipeline
as it does not need specific weight for inpainting.We hope this PR can be an initial version for
StableDiffusion3InpaintPipeline
. It can be replaced by finetuned 33 channel input inpainting weight in the later version.We put the demo here: DEMO.
How to use it?
Image Input:

Mask Input:

Prompt: Face of a yellow cat, high resolution, sitting on a park bench
SD3 output:

Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@yiyixuxu