Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[20.03] nvidia-docker: failure to find libnvidia-ml.so on LD_LIBRARY_PATH #83713

Closed
mjlbach opened this issue Mar 29, 2020 · 18 comments · Fixed by #86096
Closed

[20.03] nvidia-docker: failure to find libnvidia-ml.so on LD_LIBRARY_PATH #83713

mjlbach opened this issue Mar 29, 2020 · 18 comments · Fixed by #86096
Labels
0.kind: bug Something is broken

Comments

@mjlbach
Copy link
Contributor

mjlbach commented Mar 29, 2020

Describe the bug
On 19.09 the opengl/graphics libraries populate the LD_LIBRARY_PATH. When setting the appropriate configuration to replicate this behavior in 20.03, nvidia-docker fails to initialize cuda.

Steps to replicate

~/Repositories
❯ echo $LD_LIBRARY_PATH 
/run/opengl-driver/lib:/run/opengl-driver-32/lib   

~/Repositories 
❯ ls /run/opengl-driver/lib/libnvidia-ml.so 
/run/opengl-driver/lib/libnvidia-ml.so   

~/Repositories
❯ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi 
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that 
the NVIDIA Display Driver is properly installed and present in your system. Please also
try adding directory that contains libnvidia-ml.so to your system PATH.

Pertinent section of configuration.nix

  services.xserver.videoDrivers = [ "nvidia" ];
  # services.xserver.startDbusSession = false;
  # services.dbus.socketActivated = true;
  
  # Enable OpenGL
  hardware.opengl = {
	enable = true;
        driSupport32Bit = true;
        setLdLibraryPath = true;
    };

  # Enable Docker
  virtualisation.docker = {
	enable = true;
	enableNvidia = true;
    };

@mjlbach mjlbach added the 0.kind: bug Something is broken label Mar 29, 2020
@worldofpeace
Copy link
Contributor

This was merged in 19.09 370d3af but reverted because applications had issues. We fixed those during 20.03 development so it is still disabled.

@worldofpeace
Copy link
Contributor

@mjlbach

This comment has been minimized.

@mjlbach

This comment has been minimized.

@worldofpeace
Copy link
Contributor

I'm guessing you also tried adding it to PATH? (which is really weird considering it's a lib)

@mjlbach mjlbach changed the title LD_LIBRARY_PATH empty on nixos 20-03 [20.03] nvidia-docker: failure to find libnvidia-ml.so on LD_LIBRARY_PATH Mar 29, 2020
@mjlbach

This comment has been minimized.

@worldofpeace

This comment has been minimized.

@worldofpeace worldofpeace reopened this Mar 29, 2020
@mjlbach

This comment has been minimized.

@biggs
Copy link
Contributor

biggs commented Apr 22, 2020

Have you found any temporary work-around for this? I've had to revert back to 19.09 which is unfortunate.

@mjlbach
Copy link
Contributor Author

mjlbach commented Apr 22, 2020

Not yet, I've tried directly starting the daemon with LD_LIBRARY_PATH and LD_PRELOAD pointing to libnvidia-ml.so; I'm wondering if this would be fix by bumping the version of nvidia-container-toolkit/runc. I haven't had time yet to try updating/patching the packages.

@tbenst
Copy link
Contributor

tbenst commented Apr 24, 2020

I could reproduce this issue on my machine on 20.03, as could @tomberek.

@mjlbach
Copy link
Contributor Author

mjlbach commented Apr 24, 2020

It looks like nvidia-docker is still using nvidia-docker2 as opposed to the newer nvidia-container-toolkit, which may warrant upgrading the nix nvidia-docker ecosystem as the newer version seems to be simpler (doesn't involve replacing docker's runc) and nvidia-container-toolkit is cross compatible with podman (rootless docker alternative by redhat)

@tbenst
Copy link
Contributor

tbenst commented Apr 24, 2020

Well this is odd. First error makes sense, as there indeed is a mismatch:

> nvidia-docker run --gpus all -v '/nix/store/zvyjqq5170wzhr3fwbzbj9py31qpyvby-nvidia-x11-440.59-5.4.24/lib:/lib/cuda:ro' --env LD_LIBRARY_PATH='/lib/cuda' nvidia/cuda:10.1-base nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
> docker run --gpus all -v '/nix/store/xr1gd704d9dbfa5qw5dv8gc3l3g9dka6-nvidia-x11-440.82-5.4.32:/lib/cuda:ro' --env LD_LIBRARY_PATH='/lib/cuda' nvidia/cuda:10.1-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

@tomberek
Copy link
Contributor

Adding some symlinks manually works. But is not pretty.

docker run --gpus all --rm -it nvidia/cuda:10.2-base bash -c 'ln -sf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.* /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1; ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so.* /usr/lib/x86_64-linux-gnu/libcuda.so.1; nvidia-smi'

Another approach is to override LD_CONFIG_PATH and mount in the host's drivers:

docker run --gpus all -v '/run/opengl-driver/lib:/lib/cuda:ro' --env LD_LIBRARY_PATH=/lib/cuda nvidia/cuda:10.2-base nvidia-smi

Seems like this has something to do with ldconfig not liking patchelf'd libraries:
NixOS/patchelf#44
#27999

It also seems others have addressed the problem before: #51733 but those fixes are not working anymore.

@averelld
Copy link
Contributor

Please try this libnvc-container upgrade as a workaround: averelld@f295c70
I believe this is an upstream bug/incompatibility also in other distros, which was fixed in the meantime, but I can't find the proper issue at the moment.
The real fix (moving to the new container toolkit) would be a bit more work, especially if we have to keep the actual "nvidia-docker" binary for backwards compatibility.

tomberek pushed a commit to tomberek/nixpkgs that referenced this issue Apr 26, 2020
@tomberek
Copy link
Contributor

@averelld : your patch works

@mjlbach
Copy link
Contributor Author

mjlbach commented Apr 27, 2020

@averelld Did you want to be in charge of the PR? Otherwise I can handle it. I still think there should be a discussion about the merits of whether we should bother keeping nvidia-docker if we move to nvidia-container-toolkit (I don't see the point personally) but this at least closes out the issue.

@averelld
Copy link
Contributor

Nice. I'm also in favor of not keeping the legacy wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants