-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed train flux1 dreambooth lora can not save model #9393
Comments
Refer to huggingface/accelerate#2787 to get an idea of the adjustments needed to make it work. |
@sayakpaul |
It would have more helpful if provided more information on how you're launching the training experiments, etc. We already test if we're able to resume training:
This I don't understand. Please elaborate so that we can provide further suggestions. |
it seems as if no --train_text_encoder found in: my script as follow:
@sayakpaul
pytorch_lora_weights.zip
|
Okay so, it fails for |
only fails for train_text_encoder |
Okay that is helpful. The error you posted in #9393 (comment), seems easy to solve. We should just filter out the "module" keys in the state dict and it should work. Can you try that out first? What errors do you see in the text encoder training? |
Oh that I am not sure about then. Ccing @muellerzr for advice. |
the lora trained by deepspeed, i filter out the "module" in keys, and it could work as same as without deeepspeed:
|
Yeah of course that is why I suggested. Usually, you would want to always call
and here
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Is this still a problem? |
no problem |
same error, and i have tried this modification
but it does not work |
Describe the bug
when I run the script train_dreambooth_lora_flux.py. It raise ValueError: unexpected save model: <class 'deepspeed.runtime.engine.DeepSpeedEngine'>. something bug in save_model_hook?
![Uploading image.png…]()
Reproduction
accelerate launch train_dreambooth_lora_flux_custom.py
--pretrained_model_name_or_path=$MODEL_NAME
--instance_data_dir=$INSTANCE_DIR
--output_dir=$OUTPUT_DIR
--mixed_precision="bf16"
--instance_prompt="bedroom, YF_CN style"
--resolution=1024
--train_batch_size=1
--guidance_scale=1
--gradient_accumulation_steps=4
--optimizer="prodigy"
--learning_rate=1.
--report_to="tensorboard"
--lr_scheduler="constant"
--lr_warmup_steps=0
--num_train_epochs=30
--validation_prompt="bedroom, YF_CN style"
--validation_epochs=80
--checkpointing_steps=500
--seed="0"
--gradient_checkpointing
--use_8bit_adam
--rank=4
Logs
No response
System Info
torch==2.3.1
accelerate==0.34.2
deepspeed==0.15.1+8ac42ed7
diffusers==0.31.0.dev0
default_config.yaml as follow:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: fals
Who can help?
@sayakpaul
The text was updated successfully, but these errors were encountered: