-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Deepspeed Crashes when using MoE, Stage 2 Offload with DeepSpeedCPUAdam #5203
Comments
Merged
Hi @KyleMylonakisProtopia |
That PR seems to resolve the issue. Thanks for looking at it! |
@tjruwase, let's please close this and merge the PR :) |
github-merge-queue bot
pushed a commit
that referenced
this issue
Mar 4, 2024
The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses #5203 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
ShellyNR
pushed a commit
to ShellyNR/DeepSpeed
that referenced
this issue
Mar 11, 2024
The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses deepspeedai#5203 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
rraminen
pushed a commit
to ROCm/DeepSpeed
that referenced
this issue
May 9, 2024
The MoE- param gradients norms don't need to be averaged when created on CPU only when using 1-DP training. However, I just moved the tensor back to GPU to get average when having data-parallel on the MoE parameters and using CPU-offload. This PR addresses deepspeedai#5203 --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
When performing a training run with a model with Mixture of Experts (MoE) layers using stage 2 offload with the DeepSpeedCPUAdam optimizer, during the parameter update step the following runtime error is thrown.
When using a
ep_size=1
for the expert layers, the call toself._average_expert_grad_norms(norm_groups)
is not necessary and commenting this out resolves the issue. This of course is not a general solution forep_size > 1
, however in my case it would be sufficient to continue my work.To Reproduce
Steps to reproduce the behavior:
DeepSpeedCPUAdam
optimizer for efficient CPU offloadExpected behavior
Model training should occur with no issues or errors thrown.
ds_report output
Screenshots
N/A
System info (please complete the following information):
Launcher context
Pytorch Lightning
Docker context
Bare metal.
Additional context
I have
ep_size=1
for my mixture of expert layers, so this bug is totally avoidable by just not having the all reduce step.The text was updated successfully, but these errors were encountered: