-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The two-threaded all-to-all communication timeout error #1610
Comments
Yeah, the error message is far from perfect, but I suspect that it's not the root cause. The root cause appears to be a collective operation timeout, which could be due to any number of reasons. What do you mean by "when using multiple threads"? Are those threads using the same communicator? |
Thank you for your reply. My situation is that I used two threads to concurrently execute two AlltoAll communications. These two AlltoAll communications used the same communicator, which caused a timeout. If these two threads use different communicators, could a timeout still occur? |
If you have two threads issuing collective operations at the same time on the same communicator, there's a good chance that that's the root cause of the hangs. See https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/threadsafety.html . |
hello, When I was training the LLM MOE model and used all-to-all communication, I encountered an error "proxy.cc:1458 NCCL WARN [Service thread] Accept failed Success" when using multiple threads.
environment
NCCL Version: 2.22.3-1+cuda12.6
Megatron LLM Version: core_r0.10.0
CUDA Version: 12.6
Operating System: Debian12
Hardware:
nvidia A100 two node (16 gpu)
Here is the bug
The text was updated successfully, but these errors were encountered: