-
-
Notifications
You must be signed in to change notification settings - Fork 15.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ghc won't bootstrap on aarch64-linux anymore #97407
Comments
Bisected to patchelf update: f38ed04 |
Maybe we can try reverting b930b2d |
According to a quick test, that would only delay the segfault to later during bootstrap (hscolour, happy). |
That's quite sad as one of the major reasons for a release was better aarcht64 support. cc @delroth for ideas |
I'll look into it. For now can this be worked around by having ghc built with an older patchelf, or is this also breaking other binaries built with ghc? Would be helpful if you could attach an ELF that works pre-patchelf, the patchelf invocation, and the resulting segfaulting binary (I imagine building ghc takes forever, and I only have a "weak" ARMv8 builder machine.) |
Overriding patchelf just for the single build is not enough, unfortunately. |
@delroth building the binary package should be really fast as it only downloads the binary and patchelfs it so it can run with Nix. See https://nix-cache.s3.amazonaws.com/log/wm7l4xaq4jk8w5r9kscb4h6cibjqccjz-ghc-8.6.5-binary.drv |
Interestingly, the segfault doesn't reproduce on my machine with 64K pagesize. That's... very unexpected, I can't really think of how this would happen (the opposite is usually the problem since alignment requirement are more restrictive). I'll spin up a VM on EC2 for repro I guess. |
EDIT: though I'm not sure whether that command really shows the number you need. |
Heads up this segfault is reproducible in binfmt running on x86_64, cross-compiling to aarch64 |
I'm trying to compile a proper |
I think it might need patchelf override for all ghc-built packages (certainly for some others during ghc boostrap). At least until the patchelf bug gets found and fixed. The problem is that I saw no way of doing such a wider override. |
What if we only need to set the correct |
The thing I don't get here is that the new patchelf behavior has no reason to produce binaries that can't run on 4K pages ARMv8 systems. Generating ELF files with 64K section alignment is what binutils does, what LLVM does, and how ARMv8 binaries are shipped in pretty much every single distro in the world. I'm still trying to find time to debug this. |
I'm managing to reproduce the crash by taking the ghc binary pre-patchelf and running [nix-shell:~]$ ldd /nix/store/413gmmlyhiik8ckxbhm8wvk0fqc3nclh-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc
Segmentation fault (core dumped) But that's also reproducible on patchelf all the way to 0.9... [nix-shell:~/patchelf]$ git checkout 0.9
Previous HEAD position was e1e39f3 Update release.nix
HEAD is now at 44b7f95 Update README
[nix-shell:~/patchelf]$ autoreconf -is && ./configure && make
...
make[1]: Leaving directory '/home/ubuntu/patchelf'
[nix-shell:~/patchelf]$ cp /nix/store/413gmmlyhiik8ckxbhm8wvk0fqc3nclh-ghc-8.6.5-binary/lib/ghc-8.6.5/bin/ghc.orig /tmp/ghc
[nix-shell:~/patchelf]$ src/patchelf --set-rpath /foo/bar/lol /tmp/ghc
[nix-shell:~/patchelf]$ src/patchelf --set-rpath /foo/bar/lol:/foo/bar/lol:/foo/bar/lol:/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol/foo/bar/lol:/foo/bar/lol /tmp/ghc
[nix-shell:~/patchelf]$ ldd /tmp/ghc
Segmentation fault (core dumped) The symptoms are exactly the same though, so I suspect there's a latent bug that happened to surface with 0.12 for some reason. |
I've now tried running exactly the patchelf command that happens during the ghc build, with patchelf 0.9, 0.10, 0.11 and 0.12. In all 4 cases, the resulting binary segfaults with the exact same symptoms. I'm puzzled as to how this ever worked frankly. patchelf is broken, but this is not new brokenness. |
I finally managed to get a test case which passes on 0.11 and fails on 0.12:
Using this I bisected the patchelf change to 0470d6921b5a3fe8e92e356c8e11d120dbbb06c0 which is indeed the 4K->64K alignment change on ARMv8. I still suspect this is a completely unrelated patchelf issue that was only narrowly avoided by luck before, given that the GHC stage2 binary is itself aligned to 64K originally. Stripping was necessary to reproduce the failure in my test case, so maybe disabling stripping is all that's needed to luck into making this work again. I'm trying a build with dontStrip = true to see if that does indeed do something. |
There seems to be a latent bug with strip + patchelf on the GHC stage2 binary which got triggered again by patchelf 0.12. Until this is properly fixed in patchelf, re-disable stripping. Fixes NixOS#97407.
@vcunat you said that this workaround (disabling stripping) didn't work for you earlier in the bug. I'm kind of confused because it clearly does on my box. Could you reconfirm? |
I still get segfaults with that patch:
|
The same derivation builds fine here:
(On an EC2 Graviton2 instance, 4K page size.) |
diff --git a/pkgs/development/compilers/ghc/8.6.5-binary.nix b/pkgs/development/compilers/ghc/8.6.5-binary.nix
index 41af279e83f..2bed9f017d3 100644
--- a/pkgs/development/compilers/ghc/8.6.5-binary.nix
+++ b/pkgs/development/compilers/ghc/8.6.5-binary.nix
@@ -55,6 +55,8 @@ stdenv.mkDerivation rec {
nativeBuildInputs = [ perl ];
propagatedBuildInputs = stdenv.lib.optionals useLLVM [ llvmPackages.llvm ];
+ dontStrip = true;
+
# Cannot patchelf beforehand due to relative RPATHs that anticipate
# the final install location/
${libEnvVar} = libPath;
|
OK, the I suspect the machine maintainers might be very busy, so in the meantime let's try cachix (but note this installation hint). For now I uploaded the finished step:
as all followup ones fail very fast and incomplete builds are probably not as easy to share (though I can try if you want them). |
Hi @vcunat the only difference in ky2g.. seems to be the order things were added to the package.cache, which I hope doesn't matter. Looking at the HsColour, I think it builds twice as a ghc865Binary and is exactly the same except the second time can call itself to print out the warnings in color, etc.. From the derivation in your output, I think it was building the second which might mean the problem is that the first copy was stripped. Can you check that you have: And if so add rmw..hscolour to cachix? |
The rmw path is what fails to build ( |
Hi @vcunat I understand a bit more of the context now from those logs, yet I'm not really sure if the stage in faults at would make damage to a library not used in the earlier compiles a likely cause or not. I pushed a new branch with smaller rpaths and some printing of the elf section layout to hopefully arrive at a determination. I also added ghc8.10.2Binary in a way that is running on the rock64 (currently in stage1 build of ghc901). If you could try |
|
Hi @vcunat if you can add the 3 new logs(ghcbinary, happy, hscolour) I think there should be enough information for me to investigate what is unique/common at the points it segfaults from tracing good builds. On ghc901/rock64 it bootstraps fine, going on to xmonad fails since setlocale is whitelisting base <8.15, whitelisting to the latest official might be a common blocking point in libraries for 9.0.X? Yet, if this build doesn't fail on the community server, then I think using the 8.10.2Binary to bootstrap 8.10.2 (and earlier?) might work and/or it might add to the comparisons that can be made to figure out the cause of the segvs. Thanks! |
BTW, |
|
I've started a build of ghc8102 bumped to be dependent on ghc8012Binary, since there is a patch it is technically imperfect, but will probably work. Would it make sense to raise the default ghc 8.10.2 now anyway? Then it could make sense to build ghc8.10.2 with itself on aarch64 and 8.6.5Binary on the others. |
Hi @vcunat, 8.10.2Binary was able to bootstrap itself fine for me on aarch64 and x86_64, so I've put together an option for working around the problem by: Switching 8.10.2 to boot from its own binaries on all architectures. If it looks like a reasonable option to you, please try out 2764755 for ghc and some ghc.* packages. I'm currently building with it to xmonad-with-packages. on the rock64. Thanks, |
We can't switch the default GHC by a major version since it's following the latest LTS stackage release. |
In the meantime, the machine finished |
I think the explanation for the different behavior on the rock64 (Cortex-A53) and the ~64 core community machine is <8.10.1 Out of Order Execution for the SEGV when using the compiler. (This seems consistent with the segfaults happening with high probability in the 1st build with parallelizable work and if not in the second such round.) We could do various things to try to get 8.6/8.8 into the cache and I think they would work fine on small/old machines (though these should be becoming rare, i.e. the next in the rock64 line was the rock64Pro which has two A72 cores that could not be used.) |
Hmm, now that I said the rock64 is probably not going to hit these, I finally got segfaults on the rock64, by building ghc865.happy or ghc865.hscolour with /nix/store/349hpr41jk4s2g1naw8mpbdsdhkd47z8-ghc-8.6.5/bin/ghc from the cache (not ghc865Binary). Some logs look exactly like the ones on community, but in the core I could get the timing seems to be different:
|
|
ghc:8.10.2Binary bootstrap for 8.8 on aarch64 (#97407)
(cherry picked from commit 1c2ee21)
Can we close this issue? |
GHC boostrap on aarch64 is broken. I can't see any relevant change (introduced somewhere within #97146), and I can reproduce the segfault on the shared box. Note that also the newly branched-off 20.09 is affected.
I'm a bit sorry about letting such a big regression (in terms of package count) to master, but we also wanted to quickly fix the
nixos-unstable
channel and unblock the 20.09 process.Maintainers: @MarcWeber, @kosmikus, @peti
The text was updated successfully, but these errors were encountered: