Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.21.5 hangs, after using P2P disable, SHM disable, memory allocations. Reproducing Error 2 unhandled and Error 3 queue developers #1608

Open
HyouinSchoolAcc opened this issue Feb 16, 2025 · 1 comment

Comments

@HyouinSchoolAcc
Copy link

Settings that lead to error 3:
(myenv) exx@sn4622123662:~/Desktop/fine-tune$ printenv | grep ^NCCL
NCCL_LEGACY_CUDA_REGISTER=0
NCCL_P2P_DISABLE=1
NCCL_SOCKET_IFNAME=eth0
NCCL_P2P_LEVEL=NVL
NCCL_DEBUG=INFO
NCCL_SET_STACK_SIZE=1
NCCL_CUMEM_ENABLE=0

Removing all settings lead to error 2, or SHM disable, P2P disable, all independantly reproduce error 2

GPUs run well on their own

NCCL test without MPI fails and hangs.
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
\

@kiskra-nvidia
Copy link
Member

Sorry for the delay in responding.

What is the node configuration?

We would like to see the output generated with NCCL_DEBUG=INFO. You might want to add NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,TUNING,NET to the mix for some additional info that might help.

Is this reproducible with the latest NCCL version (2.25.1)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants