-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946
Comments
I install two packages and the new ds_report output: I find that when I start to run this scripts, the memory of cpu is getting full gradually. (avaiable from 332706 to 0) This means whether we can't train the model even with offload ? But I find this project https://github.com/philschmid/deep-learning-pytorch-huggingface have similar 8 gpus and successfully train a 11B T5 model. |
Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx |
You mean training a 11B model on your 8 * 48G GPU memory. I think it's enough even you don't use zero. I try to training 15B model on 8*32G GPU, it's OK. |
Yes, it should be enough. But I don't know why it doesn't work. This is why I new this issue. I also try it without zero and it also raise -9 error. See the issue yizhongw/Tk-Instruct#22 (comment). |
Would you like to share your 15B deepspeed config ? I have sent you an email. |
I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs |
Anyway, thanks ! |
@zhilizju, can you try disabling offloading by removing |
Amazing ! It works ! Thanks a lot ! |
I am glad that worked for you. I hope you are able to fit a reasonable batch size for your finetuning. In terms of the high CPU usage, I have suspicions of the cause which I am trying to address with #2953. Unfortunately, I am not able to actually reproduce the problem yet on my side. If you have interest or bandwidth to help, I can share some instrumented branch with you for more detailed profiling. But the important thing is to unblock you. |
Thanks for your kind reply ! I have interest to address this issue ! I think it is important. But I am a new deepseepd user and may not be able to help much. |
Hi did you manage to run the whole process? For me, I was able to initialize the model but as soon as it reaches training, the OOM error appears. I'm training a 13B model with 8x48G machine, without any offloading. |
enable gradient checkpointing to liberate a ton of gpu memory, see: #2797 (comment) in some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM. |
Thanks for the suggestion, but I'm already using gradient checkpointing in the trainer. I wonder if it's just impossible to train 13B model in fp32(the gpu doesn't support bf16) on 8x48G machine without offloading (I have 150G available RAM cpu so no offloading). |
Your quest is different from this Issue and should be dealt separately. Could you please open a new Issue where you specify all the details of your setup - e.g. you're missing your ds config file and you are not showing your command line or value of args so it's very difficult for me to see the full picture. There are other solutions for saving gpu memory - e.g., using BNB's 8-bit optimizer huggingface/transformers#15622 - though I haven't tried it with Deepspeed - But let's discuss it there. Please tag me on it. |
I use bf16 rather than fp32. My batch size on each GPU is 1 and it will lead to OOM if I increase the batch size.(And I suspect that model with fp32 may be difficult to be trained.) But I can increase “--gradient_accumulation_steps ” to 8, so the total batch size can be 64. Hope this information helps you and stas00 also gives some good suggestions worth trying. Best wishes ! |
@zhilizju, is it okay to close this issue since the original problem is resolved? |
Describe the bug
Hi, I want to finetune T5 model (11B). But the process is killed and exits with return code = -9
[2023-03-05 06:40:25,173] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698707
[2023-03-05 06:40:27,249] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698708
[2023-03-05 06:40:27,250] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698709
[2023-03-05 06:40:28,626] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698710
[2023-03-05 06:40:30,045] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698711
[2023-03-05 06:40:31,498] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698712
[2023-03-05 06:40:32,912] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698713
[2023-03-05 06:40:34,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698714
[2023-03-05 06:40:35,705] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '1', '--max_num_instances_per_eval_task', '1', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/11b_stage3_offload.config', '--bf16', '--run_name', 't5-experiment'] exits with return code = -9
To Reproduce
Steps to reproduce the behavior:
The code is based on the project https://github.com/yizhongw/Tk-Instruct
Just add a new config (name it 11b_stage3_offload.config)under the folder ds_configs
The content of the new config is :
{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": false
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Then modify the two parameters of scripts https://github.com/yizhongw/Tk-Instruct/blob/main/scripts/train_tk_instruct.sh
: repalce the
--model_name_or_path google/t5-xl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config
with
--model_name_or_path google/t5-xxl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config
and add --bf16
Expected behavior
I hope I can finetune this model.
ds_report output

Screenshots

If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context
#!/bin/bash
set -x
export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export TRANSFORMERS_CACHE=/home/lizhi/.cache/huggingface
port=$(shuf -i25000-30000 -n1)
deepspeed --master_port $port src/run_s2s.py
--do_train
--do_predict
--predict_with_generate
--model_name_or_path google/t5-xxl-lm-adapt
--max_source_length 1024
--max_target_length 128
--generation_max_length 128
--max_num_instances_per_task 1
--max_num_instances_per_eval_task 1
--add_task_name False
--add_task_definition True
--num_pos_examples 2
--num_neg_examples 0
--add_explanation False
--tk_instruct False
--data_dir data/splits/default
--task_dir data/tasks
--output_dir output/
--overwrite_output_dir
--cache_dir ./cache/
--overwrite_cache
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 5e-05
--num_train_epochs 1
--lr_scheduler_type constant
--warmup_steps 0
--logging_strategy steps
--logging_steps 500
--evaluation_strategy no
--save_strategy steps
--save_steps 2500
--deepspeed ds_configs/11b_stage3_offload.config
--bf16
--run_name t5-experiment
The text was updated successfully, but these errors were encountered: