Unable to package nvidia-docker tool for portable distribution #27999

tanasu · 2017-08-07T00:56:20Z

Issue description

I cannot build the nvidia-docker project without resorting to various hacks that are likely not portable or acceptable for other systems. Although I'm new to NixOS, by following what seem to be standard packaging techniques for writing build expressions, I ran into various issues described below. I was hoping that someone could explain better solutions to these issues that could then potentially be used to develop a package that could be further distributed.

I tried both downloading/patching pre-built binaries, and building from source.

Using pre-built binaries

When working with the pre-built binaries I had the following problems with the build process:

The source code has hard coded the location of an expected ld.so.cache file to be "/etc/ld.so.cache". It seems like NixOS in particular doesn't make use of ld.so.cache but still has the ldconfig tool available to generate it (I don't fully understand this). By default the nvidia libraries are not included in the generated libraries but they can be specified by providing an ld.so.conf file with lines giving paths to their corresponding lib directories. However, when running ldconfig I get error messages for all the patched nvidia libraries like the following (one for each corresponding library):

ldconfig: file /usr/local/nvidia/lib/libcuda.so.375.66 is truncated
(Note that for reasons described below I copied these libraries from /nix/store to /usr/local)

The resulting ld.so.cache file does not include the nvidia libraries that I think were patched when built with nix using the patchelf utility at version 0.9. This led me to manually create an ld.so.cache file that included these libraries for my machine (wrote a small python script to generate it based on what I could find was the binary format which I think essentially defines a mapping between library basenames to full paths along with various flags and metadata).
The architecture of the nvidia-docker tool requires that all the nvidia binaries and libraries reside on the same filesystem partition as the directory in which docker volumes are mounted. This is because one cannot create hard mounts across different filesystem layers. Since /nix/store is basically a separate read-only filesystem, one can't have nvidia-docker save volumes to it. On the other hand, the build script cannot access directories that are outside of /nix/store. This required me to copy the nvidia binaries (i.e. nvidia-smi) and libraries to /usr/local/nvidia and update my path to include /usr/local/nvidia/bin. Also, all the nvidia libraries in the aforementioned ld.so.cache pointed to /usr/local/nvidia/lib64 in this case.

After these changes I was able to run the nvidia-docker-plugin without an error with volumes stored at /var/lib/nvidia-docker/volumes. Within the docker containers, however, ldconfig still complains with the same truncation error (after an apt-get install for example) but otherwise GPUs are accessible and functional.

Building from source

When building from source I had the following problems:

One of the nixbld users needs either to be able to run sudo or be in the docker group to allow for the build routine to have access to a running docker daemon to build the binaries using a container (which is the method used by the provided Makefile). I tried the second option as running sudo in nix seems to be intentionally unsupported and in this case unnecessary for anything else.

Building from source allows for modification of some of the hard coded variables such as the location of the ld.so.cache file, however, it still has all the same problems as using the pre-built binaries as well.

I'm not sure if this type of build process could be considered supported by nix as access to docker is something that requires additional privileges of some sort?

Steps to reproduce

I used the following build expression and build script for the pre-built binaries:

default.nix

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "nvidia-docker-1.0.1";

  src = fetchurl {
    url = "https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1_amd64.tar.xz";
    sha256 = "0dz26is7wssph1s87221z8d8plj47lzr4hnw0spvrj2rsw70kvms";
  };

  buildInputs = [ cudatoolkit docker ];

  builder = ./bin_builder.sh;

  meta = {
    description = "Docker Engine Utility for NVIDIA GPUs";
    homepage = https://github.com/NVIDIA/nvidia-docker;
    license = "BSD-3-Clause";
  };
}

bin_builder.sh

source $stdenv/setup

function unpackPhase {
  tar -xvf $src
  cd nvidia-docker
}

function installPhase {
  mkdir -p $out/bin
  mv * $out/bin
}

function fixupPhase {
  for binary in $out/bin/*; do
    patchelf --set-interpreter $(ldconfig -v 2> /dev/null | head -1 | sed 's/.$//')/ld-linux-x86-64.so.2 $binary
    patchelf --set-rpath $(ldconfig -v 2> /dev/null | head -1 | sed 's/.$//') $binary
  done
}

unpackPhase
installPhase
fixupPhase

I put these two files in the same directory and ran nix-env --install --file default.nix.

I used the following build expression and build script to build from source:

default.nix

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "nvidia-docker-1.0.1";

  src = fetchurl {
    url = "https://github.com/NVIDIA/nvidia-docker/archive/v1.0.1.tar.gz";
    sha256 = "1kcawdcb49ri5rvzj6j2qa7yp3xfdi8j0wrgxb1kanfwz9bz0z5v";
  };

  buildInputs = [ cudatoolkit docker ];

  builder = ./builder.sh;

  meta = {
    description = "Docker Engine Utility for NVIDIA GPUs";
    homepage = https://github.com/NVIDIA/nvidia-docker;
    license = "BSD-3-Clause";
  };
}

builder.sh

source $stdenv/setup

function unpackPhase {
  tar -xvf $src
  cd nvidia-docker-1.0.1
}

function patchPhase {
  sed --in-place "s|\"/etc/ld.so.cache\"|\"/usr/local/nvidia/etc/ld.so.cache\"|" src/ldcache/ldcache.go
  sed --in-place "s|/usr/local|\"${out}\"|" Makefile
}

function buildPhase {
  make
}

function installPhase {
  make install
}

function fixupPhase {
  for binary in $out/bin/*; do
    patchelf --set-interpreter $(ldconfig -v 2> /dev/null | head -1 | sed 's/.$//')/ld-linux-x86-64.so.2 $binary
    patchelf --set-rpath $(ldconfig -v 2> /dev/null | head -1 | sed 's/.$//') $binary
  done
}

unpackPhase
patchPhase
buildPhase
installPhase
fixupPhase

In this case I added the nixbld1 user to the docker group with the following command sudo gpasswd --add nixbld1 docker. Then similarly, I put these two files in the same directory and ran nix-env --install --file default.nix. After installation I also ran sudo gpasswd --delete nixbld1 docker.

After the build process for both cases I generated the ld.so.cache manually in a likely unportable way. I can upload further info if requested. When using pre-built binares I placed the file in /etc/ld.so.cache while when using the source I placed it in /usr/local/nvidia/etc/ld.so.cache.

For testing I ran the following commands in succession:

sudo nvidia-docker-plugin which logs to stdout and stderr and manages the nvidia volumes for containers.
sudo nvidia-docker run --rm nvidia/cuda nvidia-smi which should print out GPU info.

Technical details

System: 17.03.1645.08bc48049c (Gorilla)
Nix version: 1.12pre5511_c94f3d55
Nixpkgs version: "17.03.1645.08bc48049c"
Sandboxing enabled: false

The text was updated successfully, but these errors were encountered:

timsears · 2018-01-13T18:12:03Z

Trying out your scripts... How do you generate the ld.so.cache manually?

cyounkins · 2018-12-04T19:52:13Z

Would love to get this working. I couldn't make any progress past you, but tried to do so here: cyounkins@3e17639

averelld · 2018-12-04T22:48:38Z

I also have some stuff lying around here: averelld/nixpkgs@5361ba7

Status: Unpolished, also doesn't work, because none of the nvidia/cuda libraries are found. But you can launch containers (maybe requires that you have ldconfig installed), the hook is properly called and it finds the hardware.

I think it would be better as a nixos module, because the nvidia libs will have to match the system versions, also that way it can use the system docker runc. ldconfig is pretty baked into the design of libnvidia-container, it manually parses those cache files, and makes them available to the container.

I think what might work is patching the ld.cache file location to e.g. /run/nvidia/ or similar and having a startup-wrapping script (that is basically all that nvidia-docker is anyway) that generates a ldcache file with the stuff from /run/opengl/lib and whatever else might be needed, but I didn't try that yet.

If debug is enabled, you can see some helpful messages in /var/log/nvidia-container-runtime-hook.log

Edit: I like how libelf makes all of that bmake stuff redundant

KiaraGrouwstra · 2019-03-24T13:57:46Z

does #51733 work for you?

stale · 2020-06-03T03:10:54Z

Thank you for your contributions.

This has been automatically marked as stale because it has had no activity for 180 days.

If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity.

Here are suggestions that might help resolve this more quickly:

Search for maintainers and people that previously touched the related code and @ mention them in a comment.
Ask on the NixOS Discourse.
Ask on the #nixos channel on irc.freenode.net.

averelld mentioned this issue Dec 8, 2018

nvidia-docker module/package #51733

Merged

10 tasks

tomberek mentioned this issue Apr 25, 2020

[20.03] nvidia-docker: failure to find libnvidia-ml.so on LD_LIBRARY_PATH #83713

Closed

stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 3, 2020

mt-caret mentioned this issue Jun 17, 2020

LXD should have more tools in its path #31117

Closed

6 tasks

ConnorBaker added this to CUDA Team Mar 8, 2023

ConnorBaker moved this to 🆕 New in CUDA Team Mar 8, 2023

FliegendeWurst added the 6.topic: nvidia label Oct 28, 2024

stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to package nvidia-docker tool for portable distribution #27999

Unable to package nvidia-docker tool for portable distribution #27999

tanasu commented Aug 7, 2017 •

edited

Loading

timsears commented Jan 13, 2018

cyounkins commented Dec 4, 2018

averelld commented Dec 4, 2018 •

edited

Loading

KiaraGrouwstra commented Mar 24, 2019

stale bot commented Jun 3, 2020

Unable to package nvidia-docker tool for portable distribution #27999

Unable to package nvidia-docker tool for portable distribution #27999

Comments

tanasu commented Aug 7, 2017 • edited Loading

Issue description

Using pre-built binaries

Building from source

Steps to reproduce

Technical details

timsears commented Jan 13, 2018

cyounkins commented Dec 4, 2018

averelld commented Dec 4, 2018 • edited Loading

KiaraGrouwstra commented Mar 24, 2019

stale bot commented Jun 3, 2020

tanasu commented Aug 7, 2017 •

edited

Loading

averelld commented Dec 4, 2018 •

edited

Loading