Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

zhilizju · 2023-03-05T07:57:35Z

Describe the bug
Hi, I want to finetune T5 model (11B). But the process is killed and exits with return code = -9

[2023-03-05 06:40:25,173] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698707
[2023-03-05 06:40:27,249] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698708
[2023-03-05 06:40:27,250] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698709
[2023-03-05 06:40:28,626] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698710
[2023-03-05 06:40:30,045] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698711
[2023-03-05 06:40:31,498] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698712
[2023-03-05 06:40:32,912] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698713
[2023-03-05 06:40:34,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3698714
[2023-03-05 06:40:35,705] [ERROR] [launch.py:324:sigkill_handler] ['/home/lizhi/anaconda3/envs/tk-instruct/bin/python', '-u', 'src/run_s2s.py', '--local_rank=7', '--do_train', '--do_predict', '--predict_with_generate', '--model_name_or_path', '/home/lizhi/Tk-Instruct-main/google/t5-xxl-lm-adapt', '--max_source_length', '1024', '--max_target_length', '128', '--generation_max_length', '128', '--max_num_instances_per_task', '1', '--max_num_instances_per_eval_task', '1', '--add_task_name', 'False', '--add_task_definition', 'True', '--num_pos_examples', '2', '--num_neg_examples', '0', '--add_explanation', 'False', '--tk_instruct', 'False', '--data_dir', 'data/splits/default', '--task_dir', 'data/tasks', '--output_dir', 'output/', '--overwrite_output_dir', '--cache_dir', './cache/', '--overwrite_cache', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--learning_rate', '5e-05', '--num_train_epochs', '1', '--lr_scheduler_type', 'constant', '--warmup_steps', '0', '--logging_strategy', 'steps', '--logging_steps', '500', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '2500', '--deepspeed', 'ds_configs/11b_stage3_offload.config', '--bf16', '--run_name', 't5-experiment'] exits with return code = -9

To Reproduce
Steps to reproduce the behavior:

The code is based on the project https://github.com/yizhongw/Tk-Instruct
Just add a new config (name it 11b_stage3_offload.config)under the folder ds_configs
The content of the new config is :

{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": false
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

Then modify the two parameters of scripts https://github.com/yizhongw/Tk-Instruct/blob/main/scripts/train_tk_instruct.sh
: repalce the
--model_name_or_path google/t5-xl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config
with
--model_name_or_path google/t5-xxl-lm-adapt
--deepspeed ds_configs/11b_stage3_offload.config

and add --bf16

Expected behavior
I hope I can finetune this model.

ds_report output

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 20.04]
GPU count and types [ one machines with x8 RTX6000, 48G each GPU. ]
Python version 3.8.16

Launcher context
#!/bin/bash
set -x

export CUDA_DEVICE_ORDER="PCI_BUS_ID"
export TRANSFORMERS_CACHE=/home/lizhi/.cache/huggingface

port=$(shuf -i25000-30000 -n1)

deepspeed --master_port $port src/run_s2s.py
--do_train
--do_predict
--predict_with_generate
--model_name_or_path google/t5-xxl-lm-adapt
--max_source_length 1024
--max_target_length 128
--generation_max_length 128
--max_num_instances_per_task 1
--max_num_instances_per_eval_task 1
--add_task_name False
--add_task_definition True
--num_pos_examples 2
--num_neg_examples 0
--add_explanation False
--tk_instruct False
--data_dir data/splits/default
--task_dir data/tasks
--output_dir output/
--overwrite_output_dir
--cache_dir ./cache/
--overwrite_cache
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 5e-05
--num_train_epochs 1
--lr_scheduler_type constant
--warmup_steps 0
--logging_strategy steps
--logging_steps 500
--evaluation_strategy no
--save_strategy steps
--save_steps 2500
--deepspeed ds_configs/11b_stage3_offload.config
--bf16
--run_name t5-experiment

zhilizju · 2023-03-05T08:49:57Z

I install two packages and the new ds_report output:

I find that when I start to run this scripts, the memory of cpu is getting full gradually. (avaiable from 332706 to 0)

This means whether we can't train the model even with offload ? But I find this project https://github.com/philschmid/deep-learning-pytorch-huggingface have similar 8 gpus and successfully train a 11B T5 model.

zhilizju · 2023-03-05T08:55:33Z

Any help would be appreciated @tjruwase @stas00

lambda7xx · 2023-03-05T11:41:45Z

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

zhilizju · 2023-03-05T11:54:48Z

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

lambda7xx · 2023-03-05T11:57:13Z

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

You mean training a 11B model on your 8 * 48G GPU memory. I think it's enough even you don't use zero. I try to training 15B model on 8*32G GPU, it's OK.

zhilizju · 2023-03-05T12:06:04Z

Yes, it should be enough. But I don't know why it doesn't work. This is why I new this issue. I also try it without zero and it also raise -9 error. See the issue yizhongw/Tk-Instruct#22 (comment).

zhilizju · 2023-03-05T12:10:07Z

Any help would be appreciated @tjruwase @stas00

It seems this is OOM. You memory used 513497

Yes, I know. I want to know that with 332706 M CPU and 8*48 G GPU, can‘t we finetune 11b T5 model ? If we can, what's wrong with my config or something else ? Thank you @lambda7xx

You mean training a 11B model on your 8 * 48G GPU memory. I think it's enough even you don't use zero. I try to training 15B model on 8*32G GPU, it's OK.

Would you like to share your 15B deepspeed config ? I have sent you an email.

lambda7xx · 2023-03-05T12:11:04Z

I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs

zhilizju · 2023-03-05T12:20:04Z

I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs

Anyway, thanks !

zhilizju · 2023-03-06T04:11:57Z

Still need helps. @tjruwase @stas00

tjruwase · 2023-03-06T11:37:55Z

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

zhilizju · 2023-03-06T17:15:18Z

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot !
But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

tjruwase · 2023-03-06T18:03:29Z

I am glad that worked for you. I hope you are able to fit a reasonable batch size for your finetuning.

In terms of the high CPU usage, I have suspicions of the cause which I am trying to address with #2953. Unfortunately, I am not able to actually reproduce the problem yet on my side. If you have interest or bandwidth to help, I can share some instrumented branch with you for more detailed profiling. But the important thing is to unblock you.

zhilizju · 2023-03-07T02:21:20Z

I am glad that worked for you. I hope you are able to fit a reasonable batch size for your finetuning.

In terms of the high CPU usage, I have suspicions of the cause which I am trying to address with #2953. Unfortunately, I am not able to actually reproduce the problem yet on my side. If you have interest or bandwidth to help, I can share some instrumented branch with you for more detailed profiling. But the important thing is to unblock you.

Thanks for your kind reply ! I have interest to address this issue ! I think it is important. But I am a new deepseepd user and may not be able to help much.

xinj7 · 2023-03-11T02:22:51Z

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot ! But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

Hi did you manage to run the whole process? For me, I was able to initialize the model but as soon as it reaches training, the OOM error appears. I'm training a 13B model with 8x48G machine, without any offloading.

stas00 · 2023-03-11T02:45:53Z

enable gradient checkpointing to liberate a ton of gpu memory, see: #2797 (comment)

in some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM.

xinj7 · 2023-03-11T03:11:28Z

enable gradient checkpointing to liberate a ton of gpu memory, see: #2797 (comment)

in some cases this allows you to double or quadruple the batch size if you were already able to do a small batch size w/o OOM.

Thanks for the suggestion, but I'm already using gradient checkpointing in the trainer. I wonder if it's just impossible to train 13B model in fp32(the gpu doesn't support bf16) on 8x48G machine without offloading (I have 150G available RAM cpu so no offloading).

Setting:

OOM error:

stas00 · 2023-03-11T03:44:38Z

Your quest is different from this Issue and should be dealt separately.

Could you please open a new Issue where you specify all the details of your setup - e.g. you're missing your ds config file and you are not showing your command line or value of args so it's very difficult for me to see the full picture.

There are other solutions for saving gpu memory - e.g., using BNB's 8-bit optimizer huggingface/transformers#15622 - though I haven't tried it with Deepspeed - But let's discuss it there.

Please tag me on it.

zhilizju · 2023-03-11T16:52:05Z

@zhilizju, can you try disabling offloading by removing offload_params and offload_optimizer from your ds_config? It seems that with zero stage 3 you should have enough GPU memory (across the 8 GPUs) to at least initialize the model. Please share a stack trace of this run. Thanks!

Amazing ! It works ! Thanks a lot ! But it still confused me. I find that during the initialization of this model, the cpu usage increased from 0 to 330G. It takes almost all available cpu memory. Is it reasonable ? The 11b only 40+G. If I use the offload_params and offload_optimizer, then the cpu will breakdown.

Hi did you manage to run the whole process? For me, I was able to initialize the model but as soon as it reaches training, the OOM error appears. I'm training a 13B model with 8x48G machine, without any offloading.

I use bf16 rather than fp32. My batch size on each GPU is 1 and it will lead to OOM if I increase the batch size.（And I suspect that model with fp32 may be difficult to be trained.） But I can increase “--gradient_accumulation_steps ” to 8, so the total batch size can be 64. Hope this information helps you and stas00 also gives some good suggestions worth trying. Best wishes !

tjruwase · 2023-03-17T17:51:08Z

@zhilizju, is it okay to close this issue since the original problem is resolved?

zhilizju added bug Something isn't working training labels Mar 5, 2023

zhilizju mentioned this issue Mar 5, 2023

[BUG] inference a 176 B bloom model and the process is killed . exits with return code = -9 #2918

Closed

This was referenced Mar 17, 2023

[BUG] Error "exits with return code -7" when finetuning FLANT5-xxl on 8x A100 #2897

Closed

[BUG] Process exit error on finetuning Flan T5 XXL on GCP A100 GPUs #3016

Closed

tjruwase self-assigned this Mar 17, 2023

zhilizju closed this as completed Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

zhilizju commented Mar 5, 2023 •

edited

Loading

zhilizju commented Mar 5, 2023 •

edited

Loading

zhilizju commented Mar 5, 2023

lambda7xx commented Mar 5, 2023 •

edited

Loading

zhilizju commented Mar 5, 2023

lambda7xx commented Mar 5, 2023

zhilizju commented Mar 5, 2023

zhilizju commented Mar 5, 2023

lambda7xx commented Mar 5, 2023

zhilizju commented Mar 5, 2023

zhilizju commented Mar 6, 2023

tjruwase commented Mar 6, 2023 •

edited

Loading

zhilizju commented Mar 6, 2023

tjruwase commented Mar 6, 2023

zhilizju commented Mar 7, 2023

xinj7 commented Mar 11, 2023

stas00 commented Mar 11, 2023 •

edited

Loading

xinj7 commented Mar 11, 2023 •

edited

Loading

stas00 commented Mar 11, 2023

zhilizju commented Mar 11, 2023

tjruwase commented Mar 17, 2023

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

Finetune T5 11B and the process is killed . exits with return code = -9[BUG] #2946

Comments

zhilizju commented Mar 5, 2023 • edited Loading

zhilizju commented Mar 5, 2023 • edited Loading

zhilizju commented Mar 5, 2023

lambda7xx commented Mar 5, 2023 • edited Loading

zhilizju commented Mar 5, 2023

lambda7xx commented Mar 5, 2023

zhilizju commented Mar 5, 2023

zhilizju commented Mar 5, 2023

lambda7xx commented Mar 5, 2023

zhilizju commented Mar 5, 2023

zhilizju commented Mar 6, 2023

tjruwase commented Mar 6, 2023 • edited Loading

zhilizju commented Mar 6, 2023

tjruwase commented Mar 6, 2023

zhilizju commented Mar 7, 2023

xinj7 commented Mar 11, 2023

stas00 commented Mar 11, 2023 • edited Loading

xinj7 commented Mar 11, 2023 • edited Loading

stas00 commented Mar 11, 2023

zhilizju commented Mar 11, 2023

tjruwase commented Mar 17, 2023

zhilizju commented Mar 5, 2023 •

edited

Loading

zhilizju commented Mar 5, 2023 •

edited

Loading

lambda7xx commented Mar 5, 2023 •

edited

Loading

tjruwase commented Mar 6, 2023 •

edited

Loading

stas00 commented Mar 11, 2023 •

edited

Loading

xinj7 commented Mar 11, 2023 •

edited

Loading