Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

Closed
zhilizju opened this issue Mar 5, 2023 · 20 comments
Closed
Assignees
Labels
bug Something isn't working training

Comments

@zhilizju
Copy link

zhilizju commented Mar 5, 2023

Describe the bug
Hi, I want to finetune T5 model (11B). But the process is killed and exits with return code = -9

[2023-03-05 06:40:25,173] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698707
[2023-03-05 06:40:27,249] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698708
[2023-03-05 06:40:27,250] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698709
[2023-03-05 06:40:28,626] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698710
[2023-03-05 06:40:30,045] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698711
[2023-03-05 06:40:31,498] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698712
[2023-03-05 06:40:32,912] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698713
[2023-03-05 06:40:34,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698714
[2023-03-05 06:40:35,705] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '1', '--max_num_instances_per_eval_task', '1', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/11b_stage3_offload.config', '--bf16', '--run_name', 't5-experiment'] exits with return code = -9

To Reproduce
Steps to reproduce the behavior:

The code is based on the project https://github.com/yizhongw/Tk-Instruct
Just add a new config (name it 11b_stage3_offload.config)under the folder ds_configs
The content of the new config is :

{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": false
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

Then modify the two parameters of scripts https://github.com/yizhongw/Tk-Instruct/blob/main/scripts/train_tk_instruct.sh
: repalce the
--model_name_or_path google/t5-xl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config
with
--model_name_or_path google/t5-xxl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config

and add --bf16

Expected behavior
I hope I can finetune this model.

ds_report output
image

Screenshots
If applicable, add screenshots to help explain your problem.
image

System info (please complete the following information):

  • OS: [e.g. Ubuntu 20.04]
  • GPU count and types [ one machines with x8 RTX6000, 48G each GPU. ]
  • Python version 3.8.16

Launcher context
#!/bin/bash
set -x

export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export TRANSFORMERS_CACHE=/home/lizhi/.cache/huggingface

port=$(shuf -i25000-30000 -n1)

deepspeed --master_port $port src/run_s2s.py
--do_train
--do_predict
--predict_with_generate
--model_name_or_path google/t5-xxl-lm-adapt
--max_source_length 1024
--max_target_length 128
--generation_max_length 128
--max_num_instances_per_task 1
--max_num_instances_per_eval_task 1
--add_task_name False
--add_task_definition True
--num_pos_examples 2
--num_neg_examples 0
--add_explanation False
--tk_instruct False
--data_dir data/splits/default
--task_dir data/tasks
--output_dir output/
--overwrite_output_dir
--cache_dir ./cache/
--overwrite_cache
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 5e-05
--num_train_epochs 1
--lr_scheduler_type constant
--warmup_steps 0
--logging_strategy steps
--logging_steps 500
--evaluation_strategy no
--save_strategy steps
--save_steps 2500
--deepspeed ds_configs/11b_stage3_offload.config
--bf16
--run_name t5-experiment

@zhilizju
Copy link
Author

zhilizju commented Mar 5, 2023

I install two packages and the new ds_report output:
image

I find that when I start to run this scripts, the memory of cpu is getting full gradually. (avaiable from 332706 to 0)
image

This means whether we can't train the model even with offload ? But I find this project https://github.com/philschmid/deep-learning-pytorch-huggingface have similar 8 gpus and successfully train a 11B T5 model.

@zhilizju
Copy link
Author

zhilizju commented Mar 5, 2023

Any help would be appreciated @tjruwase @stas00

@lambda7xx
Copy link

lambda7xx commented Mar 5, 2023

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

@zhilizju
Copy link
Author

zhilizju commented Mar 5, 2023

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

@lambda7xx
Copy link

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

You mean training a 11B model on your 8 * 48G GPU memory. I think it's enough even you don't use zero. I try to training 15B model on 8*32G GPU, it's OK.

@zhilizju
Copy link
Author

zhilizju commented Mar 5, 2023

Yes, it should be enough. But I don't know why it doesn't work. This is why I new this issue. I also try it without zero and it also raise -9 error. See the issue yizhongw/Tk-Instruct#22 (comment).

@zhilizju
Copy link
Author

zhilizju commented Mar 5, 2023

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

You mean training a 11B model on your 8 * 48G GPU memory. I think it's enough even you don't use zero. I try to training 15B model on 8*32G GPU, it's OK.

Would you like to share your 15B deepspeed config ? I have sent you an email.

@lambda7xx
Copy link

I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs

@zhilizju
Copy link
Author

zhilizju commented Mar 5, 2023

I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs

Anyway, thanks !

@zhilizju
Copy link
Author

zhilizju commented Mar 6, 2023

Still need helps. @tjruwase @stas00

@tjruwase
Copy link
Contributor

tjruwase commented Mar 6, 2023

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

@zhilizju
Copy link
Author

zhilizju commented Mar 6, 2023

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot !
But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

@tjruwase
Copy link
Contributor

tjruwase commented Mar 6, 2023

I am glad that worked for you. I hope you are able to fit a reasonable batch size for your finetuning.

In terms of the high CPU usage, I have suspicions of the cause which I am trying to address with #2953. Unfortunately, I am not able to actually reproduce the problem yet on my side. If you have interest or bandwidth to help, I can share some instrumented branch with you for more detailed profiling. But the important thing is to unblock you.

@zhilizju
Copy link
Author

zhilizju commented Mar 7, 2023

I am glad that worked for you. I hope you are able to fit a reasonable batch size for your finetuning.

In terms of the high CPU usage, I have suspicions of the cause which I am trying to address with #2953. Unfortunately, I am not able to actually reproduce the problem yet on my side. If you have interest or bandwidth to help, I can share some instrumented branch with you for more detailed profiling. But the important thing is to unblock you.

Thanks for your kind reply ! I have interest to address this issue ! I think it is important. But I am a new deepseepd user and may not be able to help much.

@xinj7
Copy link

xinj7 commented Mar 11, 2023

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot ! But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

Hi did you manage to run the whole process? For me, I was able to initialize the model but as soon as it reaches training, the OOM error appears. I'm training a 13B model with 8x48G machine, without any offloading.

@stas00
Copy link
Collaborator

stas00 commented Mar 11, 2023

enable gradient checkpointing to liberate a ton of gpu memory, see: #2797 (comment)

in some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM.

@xinj7
Copy link

xinj7 commented Mar 11, 2023

enable gradient checkpointing to liberate a ton of gpu memory, see: #2797 (comment)

in some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM.

Thanks for the suggestion, but I'm already using gradient checkpointing in the trainer. I wonder if it's just impossible to train 13B model in fp32(the gpu doesn't support bf16) on 8x48G machine without offloading (I have 150G available RAM cpu so no offloading).

Setting:
image

OOM error:
image

@stas00
Copy link
Collaborator

stas00 commented Mar 11, 2023

Your quest is different from this Issue and should be dealt separately.

Could you please open a new Issue where you specify all the details of your setup - e.g. you're missing your ds config file and you are not showing your command line or value of args so it's very difficult for me to see the full picture.

There are other solutions for saving gpu memory - e.g., using BNB's 8-bit optimizer huggingface/transformers#15622 - though I haven't tried it with Deepspeed - But let's discuss it there.

Please tag me on it.

@zhilizju
Copy link
Author

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot ! But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

Hi did you manage to run the whole process? For me, I was able to initialize the model but as soon as it reaches training, the OOM error appears. I'm training a 13B model with 8x48G machine, without any offloading.

I use bf16 rather than fp32. My batch size on each GPU is 1 and it will lead to OOM if I increase the batch size.(And I suspect that model with fp32 may be difficult to be trained.) But I can increase “--gradient_accumulation_steps ” to 8, so the total batch size can be 64. Hope this information helps you and stas00 also gives some good suggestions worth trying. Best wishes !

@tjruwase
Copy link
Contributor

@zhilizju, is it okay to close this issue since the original problem is resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

5 participants