-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DeepSpeed Zero 3 taking to much memory for FLAN-T5-XL (3B) #2797
Comments
cc: @tjruwase I was able to reproduce this large memory usage on a single a100. The gpu memory gets cleared before each iteration to almost 0, and then peaks to about 20GB during each iteration. I suspect that a lot of this memory is normal allocations but it's very likely it shouldn't need 20GB if offload is to work efficiently. If I turn both offloads off then the memory usage goes up to 70GB. So we definitely know offload works, just perhaps not as efficiently as we would like it to be. |
Hi Philipp, So Tunji and I spent some time researching this Issue - we plugged a bunch of That is the offload works correctly, no bugs there. We were discussing creating a calculator to estimate activations memory usage to resolve such situations easier. Now, fear not, I have a solution for your situation that comes from HF You just need to change your trainer to pass on The reason my initial |
Can confirm with |
Hello! I tried to use activation checkpointing provided by deepspeed but the OOM still existed, while gradient checkpointing in huggingface trainer did solve the OOM problem. Isn't activation checkpointing in deepspeed the same as gradient checkpointing in huggingface traininer? I'm a bit confused now. |
Activation checkpointing and gradient checkpointing are 2 terms for the same methodology. Except HF Transformers models don't know anything about Deepspeed's activation checkpointing. So if you want to use HF Transformers models you do If you write your own model and you want to use Deepspeed's activation checkpointing you use the API prescribed there. I hope I was able to help with clarity here, @nonstopfor And I totally agree with that it's odd that the same feature acquired two different names - it took me a while to get used to that. |
Thanks very much! It works now. I was thinking deepspeed.initialize() may activate the gradient checkpointing automatically if activation checkpointing is set in the deepspeed config file. |
Glad to hear you have it working, @nonstopfor As I explained above you have to recode the modeling code to replace torch's checkpointing with deepspeed's version of it as explained here: Their checkpointing is more flexible since you could offload to cpu instead of recalculating, so if you want to squeeze more performance you could copy HF's model and replace our checkpointing with deepspeed's and configure it to use their advanced features. But chances are that what HF provides out of the box is good enough performance-wise. |
Describe the bug
I am tryiny to train FLAN-T5-XL using DeepSpeed zero 3 and
transformers
and it seems z3/ cpu offload seems to use quite a lot of gpu memory as compared to the expectations. I am running on 4x V100 16GB. And i ran theestimate_zero3_model_states_mem_needs_all_cold
test which gave me the following resultsI know that's only for weights+grads+optim states, but my assumption would be that with offloading it can fit on 4x V100 16GB using a BS 1
To Reproduce
Steps to reproduce the behavior:
1.Clone script and create directory
2. create dataset with the following script
Expected behavior
My Assumption is that training with Zero 3 and offload should work on 4x V100
System info (please complete the following information):
transformers
version: 4.26.0The text was updated successfully, but these errors were encountered: