Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalledProcessError: Command '['hostname -I']' died with <Signals.SIGSEGV: 11>. #2837

Closed
saforem2 opened this issue Feb 15, 2023 · 5 comments
Closed
Assignees

Comments

@saforem2
Copy link
Collaborator

Not sure the cause, but trying to run multi-node training (launching with mpich), I'm getting the following error:

  File "/lus/grand/projects/datascience/foremans/locations/polaris/projects/saforem2/Megatron-DeepSpeed/dist.py", line 106, in init_deepspeed
    deepspeed.init_distributed()
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 646, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 674, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/lus/grand/projects/datascience/foremans/locations/polaris/miniconda3/envs/2022-09-08-hvd-nccl/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' died with <Signals.SIGSEGV: 11>.

The error is originating from deepspeed/comm/comm.py:

https://github.com/microsoft/DeepSpeed/blob/46784cb58edf7bbe9b6bbec95212de7b81e55b01/deepspeed/comm/comm.py#L676

An easy fix would be replacing the

hostname_cmd = ["hostname -I"]
result = subprocess.check_output(hostname_cmd, shell=True)
master_addr = result.decode('utf-8').split()[0]

with

import socket
master_addr = socket.gethostbyaddr(socket.gethostname())[0]
@loadams
Copy link
Collaborator

loadams commented Aug 21, 2023

Hi @saforem2 - sorry for the late reply. This looks to be because the permissions of the hostname -I command that is trying to be invoked by DeepSpeed is failing on your system. Do you know what is causing that?

Also would you mind making a PR with the suggested change?

@loadams loadams self-assigned this Aug 21, 2023
@saforem2
Copy link
Collaborator Author

Yeah, no worries. Honestly this was actually a (seemingly?) intermittent issue (that I haven't seen in a while, come to think of it) so I never bothered to pin it down.

I guess they both achieve the same thing, though maybe the method using socket is simpler ?

but yeah, happy to submit a PR if you think this would be preferred

@loadams
Copy link
Collaborator

loadams commented Aug 21, 2023

Actually, I'm seeing different results on my machine using the two approaches, hostname is retuning the IPv6 address, the socket method is returning my machine name it seems. To avoid breaking other things, perhaps we leave it as it is if its not causing issues that you've not seen in a while?

@saforem2
Copy link
Collaborator Author

yeah sounds good, happy to close this then

@loadams
Copy link
Collaborator

loadams commented Aug 21, 2023

Thanks for reporting the bug, and hopefully whatever was causing it remains fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants