-
-
Notifications
You must be signed in to change notification settings - Fork 15.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to package nvidia-docker tool for portable distribution #27999
Comments
Trying out your scripts... How do you generate the ld.so.cache manually? |
Would love to get this working. I couldn't make any progress past you, but tried to do so here: cyounkins@3e17639 |
I also have some stuff lying around here: averelld/nixpkgs@5361ba7 Status: Unpolished, also doesn't work, because none of the nvidia/cuda libraries are found. But you can launch containers (maybe requires that you have ldconfig installed), the hook is properly called and it finds the hardware. I think it would be better as a nixos module, because the nvidia libs will have to match the system versions, also that way it can use the system docker runc. ldconfig is pretty baked into the design of libnvidia-container, it manually parses those cache files, and makes them available to the container. I think what might work is patching the ld.cache file location to e.g. If debug is enabled, you can see some helpful messages in Edit: I like how libelf makes all of that bmake stuff redundant |
does #51733 work for you? |
Thank you for your contributions. This has been automatically marked as stale because it has had no activity for 180 days. If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity. Here are suggestions that might help resolve this more quickly:
|
Issue description
I cannot build the nvidia-docker project without resorting to various hacks that are likely not portable or acceptable for other systems. Although I'm new to NixOS, by following what seem to be standard packaging techniques for writing build expressions, I ran into various issues described below. I was hoping that someone could explain better solutions to these issues that could then potentially be used to develop a package that could be further distributed.
I tried both downloading/patching pre-built binaries, and building from source.
Using pre-built binaries
When working with the pre-built binaries I had the following problems with the build process:
The source code has hard coded the location of an expected ld.so.cache file to be "/etc/ld.so.cache". It seems like NixOS in particular doesn't make use of ld.so.cache but still has the ldconfig tool available to generate it (I don't fully understand this). By default the nvidia libraries are not included in the generated libraries but they can be specified by providing an ld.so.conf file with lines giving paths to their corresponding lib directories. However, when running ldconfig I get error messages for all the patched nvidia libraries like the following (one for each corresponding library):
ldconfig: file /usr/local/nvidia/lib/libcuda.so.375.66 is truncated
(Note that for reasons described below I copied these libraries from /nix/store to /usr/local)
The resulting ld.so.cache file does not include the nvidia libraries that I think were patched when built with nix using the patchelf utility at version 0.9. This led me to manually create an ld.so.cache file that included these libraries for my machine (wrote a small python script to generate it based on what I could find was the binary format which I think essentially defines a mapping between library basenames to full paths along with various flags and metadata).
The architecture of the nvidia-docker tool requires that all the nvidia binaries and libraries reside on the same filesystem partition as the directory in which docker volumes are mounted. This is because one cannot create hard mounts across different filesystem layers. Since /nix/store is basically a separate read-only filesystem, one can't have nvidia-docker save volumes to it. On the other hand, the build script cannot access directories that are outside of /nix/store. This required me to copy the nvidia binaries (i.e. nvidia-smi) and libraries to /usr/local/nvidia and update my path to include /usr/local/nvidia/bin. Also, all the nvidia libraries in the aforementioned ld.so.cache pointed to /usr/local/nvidia/lib64 in this case.
After these changes I was able to run the nvidia-docker-plugin without an error with volumes stored at /var/lib/nvidia-docker/volumes. Within the docker containers, however, ldconfig still complains with the same truncation error (after an apt-get install for example) but otherwise GPUs are accessible and functional.
Building from source
When building from source I had the following problems:
Building from source allows for modification of some of the hard coded variables such as the location of the ld.so.cache file, however, it still has all the same problems as using the pre-built binaries as well.
I'm not sure if this type of build process could be considered supported by nix as access to docker is something that requires additional privileges of some sort?
Steps to reproduce
I used the following build expression and build script for the pre-built binaries:
default.nix
bin_builder.sh
I put these two files in the same directory and ran
nix-env --install --file default.nix
.I used the following build expression and build script to build from source:
default.nix
builder.sh
In this case I added the nixbld1 user to the docker group with the following command
sudo gpasswd --add nixbld1 docker
. Then similarly, I put these two files in the same directory and rannix-env --install --file default.nix
. After installation I also ransudo gpasswd --delete nixbld1 docker
.After the build process for both cases I generated the ld.so.cache manually in a likely unportable way. I can upload further info if requested. When using pre-built binares I placed the file in /etc/ld.so.cache while when using the source I placed it in /usr/local/nvidia/etc/ld.so.cache.
For testing I ran the following commands in succession:
sudo nvidia-docker-plugin
which logs to stdout and stderr and manages the nvidia volumes for containers.sudo nvidia-docker run --rm nvidia/cuda nvidia-smi
which should print out GPU info.Technical details
The text was updated successfully, but these errors were encountered: