Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RISC-V boot failure after LLVM commit e87f33d9ce785668223c3bcc4e06956985cccda1 #1965

Closed
nathanchance opened this issue Dec 11, 2023 · 15 comments
Labels
[ARCH] risc-v This bug impacts ARCH=riscv [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) [FIXED][LLVM] main This bug was only present and fixed in an unreleased version of LLVM

Comments

@nathanchance
Copy link
Member

nathanchance commented Dec 11, 2023

After llvm/llvm-project@e87f33d, I see a boot failure with ARCH=riscv defconfig:

$ make -sjk"$(nproc)" ARCH=riscv LLVM=1 mrproper defconfig Image

$ boot-qemu.py -k .
...
[    0.000000] Linux version 6.7.0-rc5 (nathan@dev-arch.thelio-3990X) (ClangBuiltLinux clang version 18.0.0git (https://github.com/llvm/llvm-project e87f33d9ce785668223c3bcc4e06956985cccda1), ClangBuiltLinux LLD 18.0.0) #1 SMP Mon Dec 11 10:23:26 MST 2023
...
[    0.000000] percpu: Embedded 20 pages/cpu s41592 r8192 d32136 u81920
[    0.000000] Unable to handle kernel paging request at virtual address ff6000001ffe9040
[    0.000000] Oops [#1]
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.7.0-rc5 #1
[    0.000000] Hardware name: riscv-virtio,qemu (DT)
[    0.000000] epc : 0xff6000001ffe9040
[    0.000000]  ra : 0xff6000001ffe9040
[    0.000000] epc : ff6000001ffe9040 ra : ff6000001ffe9040 sp : ffffffff81203e00
[    0.000000]  gp : ffffffff813102c0 tp : ffffffff8120d580 t0 : 0000000000000000
[    0.000000]  t1 : 0000000000000000 t2 : ffffffff8134a648 s0 : 0000000000000004
[    0.000000]  s1 : ffffffff80c21000 a0 : ff6000001ffeadc0 a1 : 000000000000a000
[    0.000000]  a2 : 0000000000000000 a3 : 000000000000a000 a4 : ff6000001ffeadcc
[    0.000000]  a5 : 0000000000000000 a6 : 0000000000000028 a7 : 0000000000000400
[    0.000000]  s2 : ff6000001ffe9000 s3 : 0000000000014000 s4 : 0000000000001000
[    0.000000]  s5 : 0000000000000018 s6 : 0000000000000040 s7 : 0000000000000001
[    0.000000]  s8 : 000000009ffe7fff s9 : ffffffff81310530 s10: ff6000001ffeaec0
[    0.000000]  s11: ff6000001ffe9050 t3 : ffffffff8134b398 t4 : ffffffff8134b398
[    0.000000]  t5 : ffffffff8134b2d8 t6 : ffffffff8134b398
[    0.000000] status: 0000000200000100 badaddr: ff6000001ffe9040 cause: 000000000000000c
[    0.000000] Code: 0000 0000 1000 0000 0000 0000 0001 0000 0000 0000 (0001) 0000
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
qemu-system-riscv64: terminating on signal 15 from pid 36702 (timeout)

That commit seems rather innocuous, so it seems likely that this has just exposed some other issue related to linker relaxation because if I apply the following diff that adds -mno-relax to KBUILD_CFLAGS and KBUILD_AFLAGS, the kernel boots fine.

diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index a74be78678eb..d66f09fee2b4 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -44,7 +44,6 @@ else
 endif

 ifeq ($(CONFIG_LD_IS_LLD),y)
-ifeq ($(call test-lt, $(CONFIG_LLD_VERSION), 150000),y)
        KBUILD_CFLAGS += -mno-relax
        KBUILD_AFLAGS += -mno-relax
 ifndef CONFIG_AS_IS_LLVM
@@ -52,7 +51,6 @@ ifndef CONFIG_AS_IS_LLVM
        KBUILD_AFLAGS += -Wa,-mno-relax
 endif
 endif
-endif

 ifeq ($(CONFIG_SHADOW_CALL_STACK),y)
        KBUILD_LDFLAGS += --no-relax-gp

cc @topperc @MaskRay just in case you have any immediate ideas of where I should look to see what is going on here.

Bisect log
# bad: [08cb64034f17d50a660ec78ce8ea81a025b0ba71] [NFC] Remove an unused decl to avoid warning
# good: [13b88265088329decd15449e3b2461a6177174b2] Revert " [OpenMP][NFC] Remove `DelayedBinDesc`" (#74679)
git bisect start '08cb64034f17d50a660ec78ce8ea81a025b0ba71' '13b88265088329decd15449e3b2461a6177174b2'
# bad: [f5724847ec6d7e157f711a590e73895e0f048fc4] [mlir][Transforms][NFC] GreedyPatternRewriteDriver: Remove redundant worklist management code (#74796)
git bisect bad f5724847ec6d7e157f711a590e73895e0f048fc4
# good: [04c4566ca19c054c26460a14270086f1fbaf9abd] [flang] Use `genOpenMPTerminator` to insert terminator (#74719)
git bisect good 04c4566ca19c054c26460a14270086f1fbaf9abd
# bad: [58785ebd24b82f7d1d5fa6a0f8bb2a15de130230] [SLP][NFC]Check for ephemeral values beforehand, NFC.
git bisect bad 58785ebd24b82f7d1d5fa6a0f8bb2a15de130230
# good: [0e6685ab1a8313cd1dc7eb3c99ff642e6c492aa2] [ELF,test] Improve tombstone value tests
git bisect good 0e6685ab1a8313cd1dc7eb3c99ff642e6c492aa2
# good: [a4d4b45aef6dbac1cead60dcba5e60939fc1656d] [ELF] relocateNonAlloc: move likely expr == R_ABS before unlikely R_SIZE. NFC
git bisect good a4d4b45aef6dbac1cead60dcba5e60939fc1656d
# good: [7030aab1d7a33c17d72eaf721c679be6ca0b073d] [gn build] Port 3ed940ac3dac
git bisect good 7030aab1d7a33c17d72eaf721c679be6ca0b073d
# good: [ab4d6cd6d14cef1a167de1aea2fe44900d1d7309] [TextAPI] Update RecordSlice attributes to follow code guidelines (#74743)
git bisect good ab4d6cd6d14cef1a167de1aea2fe44900d1d7309
# bad: [ffb2af3ed6a95a4eb55b81a3d1351d5d4bd66eb5] [SCEVExpander] Attempt to reinfer flags dropped due to CSE (#72431)
git bisect bad ffb2af3ed6a95a4eb55b81a3d1351d5d4bd66eb5
# bad: [e87f33d9ce785668223c3bcc4e06956985cccda1] [RISCV][MC] Pass MCSubtargetInfo down to shouldForceRelocation and evaluateTargetFixup. (#73721)
git bisect bad e87f33d9ce785668223c3bcc4e06956985cccda1
# first bad commit: [e87f33d9ce785668223c3bcc4e06956985cccda1] [RISCV][MC] Pass MCSubtargetInfo down to shouldForceRelocation and evaluateTargetFixup. (#73721)
@nathanchance nathanchance added [BUG] Untriaged Something isn't working [ARCH] risc-v This bug impacts ARCH=riscv [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) labels Dec 11, 2023
@topperc
Copy link

topperc commented Dec 11, 2023

Maybe a patch to the compiler like this can give some clue

diff --git a/llvm/lib/Target/RISCV/MCTargetDesc/RISCVAsmBackend.cpp b/llvm/lib/Target/RISCV/MCTargetDesc/RISCVAsmBackend.cpp
index 716fb67c5824..5f6a0dd6eac2 100644
--- a/llvm/lib/Target/RISCV/MCTargetDesc/RISCVAsmBackend.cpp
+++ b/llvm/lib/Target/RISCV/MCTargetDesc/RISCVAsmBackend.cpp
@@ -129,6 +129,9 @@ bool RISCVAsmBackend::shouldForceRelocation(const MCAssembler &Asm,
     return true;
   }
 
+  if (this->STI.hasFeature(RISCV::FeatureRelax) && !STI->hasFeature(RISCV::FeatureRelax))
+    assert(0);
+
   return STI->hasFeature(RISCV::FeatureRelax) || ForceRelocs;
 }
 

@nathanchance
Copy link
Member Author

That does trigger it seems.

clang: /home/nathan/cbl/src/llvm-project/llvm/lib/Target/RISCV/MCTargetDesc/RISCVAsmBackend.cpp:133: virtual bool llvm::RISCVAsmBackend::shouldForceRelocation(const MCAssembler &, const MCFixup &, const MCValue &, const MCSubtargetInfo *): Assertion `0' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /mnt/nvme/tmp/build/llvm-bisect/final/bin/clang --target=riscv64-linux-gnu -fintegrated-as -Werror=unknown-warning-option -Werror=ignored-optimization-argument -Werror=option-ignored -Werror=unused-command-line-argument -fmacro-prefix-map=/home/nathan/cbl/src/linux/= -std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing -mabi=lp64 -march=rv64imac_zihintpause -mno-save-restore -mcmodel=medany -fno-asynchronous-unwind-tables -fno-unwind-tables -mstrict-align -fno-delete-null-pointer-checks -O2 -fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls -ftrivial-auto-var-init=zero -fstrict-flex-arrays=3 -fno-strict-overflow -fno-stack-check -Wall -Wundef -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs -Wno-frame-address -Wno-address-of-packed-member -Wframe-larger-than=2048 -Wno-gnu -Wno-unused-but-set-variable -Wno-unused-const-variable -Wvla -Wno-pointer-sign -Wcast-function-type -Wimplicit-fallthrough -Werror=date-time -Werror=incompatible-pointer-types -Wenum-conversion -Wno-unused-but-set-variable -Wno-unused-const-variable -Wno-format-overflow -Wno-format-truncation -Wno-pointer-to-enum-cast -Wno-tautological-constant-out-of-range-compare -Wno-unaligned-access -Wno-cast-function-type-strict -Wno-missing-field-initializers -Wno-type-limits -Wno-shift-negative-value -Wno-initializer-overrides -Wno-sign-compare -nostdinc -I/home/nathan/cbl/src/linux/arch/riscv/include -I./arch/riscv/include/generated -I/home/nathan/cbl/src/linux/include -I./include -I/home/nathan/cbl/src/linux/arch/riscv/include/uapi -I./arch/riscv/include/generated/uapi -I/home/nathan/cbl/src/linux/include/uapi -I./include/generated/uapi -include /home/nathan/cbl/src/linux/include/linux/compiler-version.h -include /home/nathan/cbl/src/linux/include/linux/kconfig.h -include /home/nathan/cbl/src/linux/include/linux/compiler_types.h -D__KERNEL__ -DCONFIG_PAGE_OFFSET=0xff60000000000000 -I /home/nathan/cbl/src/linux/mm -I ./mm -DKBUILD_MODFILE=\"mm/show_mem\" -DKBUILD_BASENAME=\"show_mem\" -DKBUILD_MODNAME=\"show_mem\" -D__KBUILD_MODNAME=kmod_show_mem -c -Wp,-MMD,mm/.show_mem.o.d -fcolor-diagnostics -o mm/show_mem.o /home/nathan/cbl/src/linux/mm/show_mem.c
1.      <eof> parser at end of file
2.      Code generation
 #0 0x000055b6223b03a8 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x39d83a8)
 #1 0x000055b6223adfde llvm::sys::RunSignalHandlers() (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x39d5fde)
 #2 0x000055b622332616 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0
 #3 0x00007fe865e7c710 (/usr/lib/libc.so.6+0x3e710)
 #4 0x00007fe865ecc83c (/usr/lib/libc.so.6+0x8e83c)
 #5 0x00007fe865e7c668 gsignal (/usr/lib/libc.so.6+0x3e668)
 #6 0x00007fe865e644b8 abort (/usr/lib/libc.so.6+0x264b8)
 #7 0x00007fe865e643dc (/usr/lib/libc.so.6+0x263dc)
 #8 0x00007fe865e74d26 (/usr/lib/libc.so.6+0x36d26)
 #9 0x000055b6212a0e4a (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x28c8e4a)
#10 0x000055b622118381 llvm::MCAssembler::evaluateFixup(llvm::MCAsmLayout const&, llvm::MCFixup const&, llvm::MCFragment const*, llvm::MCValue&, llvm::MCSubtargetInfo const*, unsigned long&, bool&) const (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x3740381)
#11 0x000055b62211af21 llvm::MCAssembler::layout(llvm::MCAsmLayout&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x3742f21)
#12 0x000055b62211b6c6 llvm::MCAssembler::Finish() (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x37436c6)
#13 0x000055b62213bdd5 llvm::MCELFStreamer::finishImpl() (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x3763dd5)
#14 0x000055b6231cb9d4 llvm::AsmPrinter::doFinalization(llvm::Module&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x47f39d4)
#15 0x000055b621ee087a llvm::FPPassManager::doFinalization(llvm::Module&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x350887a)
#16 0x000055b621ed94a0 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x35014a0)
#17 0x000055b622b43415 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x416b415)
#18 0x000055b622b665a1 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x418e5a1)
#19 0x000055b623e511c6 clang::ParseAST(clang::Sema&, bool, bool) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x54791c6)
#20 0x000055b622f65f4f clang::FrontendAction::Execute() (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x458df4f)
#21 0x000055b622ed6c5d clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x44fec5d)
#22 0x000055b623030778 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x4658778)
#23 0x000055b62112acf2 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x2752cf2)
#24 0x000055b62112713d ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) driver.cpp:0:0
#25 0x000055b622d32a99 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::$_0>(long) Job.cpp:0:0
#26 0x000055b622332396 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x395a396)
#27 0x000055b622d321a2 clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x435a1a2)
#28 0x000055b622ced307 clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x4315307)
#29 0x000055b622ced847 clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x4315847)
#30 0x000055b622d0d839 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x4335839)
#31 0x000055b6211265f6 clang_main(int, char**, llvm::ToolContext const&) (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x274e5f6)
#32 0x000055b621137401 main (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x275f401)
#33 0x00007fe865e65cd0 (/usr/lib/libc.so.6+0x27cd0)
#34 0x00007fe865e65d8a __libc_start_main (/usr/lib/libc.so.6+0x27d8a)
#35 0x000055b6211236e5 _start (/mnt/nvme/tmp/build/llvm-bisect/final/bin/clang+0x274b6e5)
clang: error: clang frontend command failed with exit code 134 (use -v to see invocation)
ClangBuiltLinux clang version 18.0.0git (https://github.com/llvm/llvm-project f58f089164ade240e1bcb60b7c3cead840a33e1c)
Target: riscv64-unknown-linux-gnu
Thread model: posix
InstalledDir: /mnt/nvme/tmp/build/llvm-bisect/final/bin
clang: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /tmp/show_mem-f5034f.c
clang: note: diagnostic msg: /tmp/show_mem-f5034f.sh
clang: note: diagnostic msg:

********************

@topperc
Copy link

topperc commented Dec 11, 2023

@nathanchance can you attach the files from the crash report?

@nathanchance
Copy link
Member Author

nathanchance commented Dec 11, 2023

@topperc Here's one of the preprocessed files from the build system directly with a simplified set of flags.

tcp.i.txt

$ clang --target=riscv64-linux-gnu -O2 -c -o /dev/null tcp.i
...
clang: /home/nathan/cbl/src/llvm-project/llvm/lib/Target/RISCV/MCTargetDesc/RISCVAsmBackend.cpp:133: virtual bool llvm::RISCVAsmBackend::shouldForceRelocation(const MCAssembler &, const MCFixup &, const MCValue &, const MCSubtargetInfo *): Assertion `0' failed.
...

@topperc
Copy link

topperc commented Dec 12, 2023

The particular place that's failing is

.Ltmp76:
        .option push
        .option norvc
        .option norelax
        j       .LBB37_5
        .option pop

We no longer emit a relocation for this jump because of the norelax. We still need to emit a relocation so the jump target can be resolved when we relax everything else.

An easy fix is to OR this->STI.hasFeature(RISCV::FeatureRelax) with the STI->hasFeature(RISCV::FeatureRelax).

@MaskRay what are your thoughts?

@MaskRay
Copy link
Member

MaskRay commented Dec 12, 2023

tl;dr I suspect that the kernel makes a brittle assumption but I am not familiar with the kernel enough to find it.

@topperc Here's one of the preprocessed files from the build system directly with a simplified set of flags.

tcp.i.txt

The reduced example is:

cat > tcp.ll <<'eof'
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "riscv64-unknown-linux-gnu"

define i32 @tcp_orphan_count_sum() {
entry:
  callbr void asm sideeffect "886 :\0A.option push\0A.option norvc\0A.option norelax\0Aj ${0:l}\0A.option pop\0A887 :\0A.if 1 == 1\0A.pushsection .alternative, \22a\22\0A.4byte\09((886b) - .) \0A.4byte\09((888f) - .) \0A.2byte\090\0A.2byte 889f - 888f\0A.4byte\0930\0A.popsection\0A.subsection 1\0A888 :\0A.option push\0A.option norvc\0A.option norelax\0Anop\0A.option pop\0A889 :\0A.org\09. - (887b - 886b) + (889b - 888b)\0A.org\09. - (889b - 888b) + (887b - 886b)\0A.previous\0A.endif\0A", "!i"()
          to label %do.body [label %do.body]

do.body:                                          ; preds = %entry, %entry
  ret i32 0
}
eof
/tmp/Rel/bin/llc -filetype=obj -mtriple=riscv64-linux-gnu -mattr=+relax,+c tcp.ll -o onepass.o
/tmp/Rel/bin/llc -mtriple=riscv64-linux-gnu -mattr=+relax,+c tcp.ll -o tcp.s && /tmp/Rel/bin/llvm-mc -filetype=obj -triple=riscv64-linux-gnu -mattr=+relax,+c tcp.s -o tcp.o

onepass.o (MCObjectStreamer) and tcp.o (MCAsmStreamer+assembler) differ in one relocation R_RISCV_JAL.
tcp.s looks like the following (with simplification).

.Ltmp0:
	.option	push
	.option	norvc
	.option	norelax
	j	.LBB0_1
	.option	pop
.Ltmp1:
	.section	.alternative,"a",@progbits
.Ltmp2:
	.word	.Ltmp0-.Ltmp2
.Ltmp4:
	.word	.Ltmp3-.Ltmp4
	.half	0
	.half	.Ltmp5-.Ltmp3
	.word	30
	.text

	.text	1
...
	.text
.LBB0_1:                                # Block address taken
	li	a0, 0
	ret

In tcp.o, j .LBB0_1 leads to a R_RISCV_JAL due to setForceRelocs==true. It's set but not cleared (from the original .option relax patch https://reviews.llvm.org/D46423). I feel nervous of the behavior and we should fix it in the future.

Ideally j .LBB0_1 should not lead to a R_RISCV_JAL. Even with the subsection changes, j .LBB0_1 can be determined at assembly-time without a relocation.
The MCObjectStreamer code path does omits the relocation. I think the new Clang behavior is correct and I suspect the kernel is making some brittle assumption.

Unfortunately, my attempt to force j %l[legacy] to emit a R_RISCV_JAL relocation does not fix the boot problem. The problem is elsewhere.

diff --git a/arch/riscv/include/asm/bitops.h b/arch/riscv/include/asm/bitops.h
index 224b4dc02b50..80af1e766751 100644
--- a/arch/riscv/include/asm/bitops.h
+++ b/arch/riscv/include/asm/bitops.h
@@ -36,9 +36,11 @@
 #endif

+#define JUMP_TO_LEGACY ".reloc ., R_RISCV_JAL, %l[legacy]; .word 0x0000006f"
+
 static __always_inline unsigned long variable__ffs(unsigned long word)
 {
        int num;

-       asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
+       asm_volatile_goto(ALTERNATIVE(JUMP_TO_LEGACY, "nop", 0,
                                      RISCV_ISA_EXT_ZBB, 1)
                          : : : : legacy);
@@ -96,5 +98,5 @@ static __always_inline unsigned long variable__fls(unsigned long word)
        int num;

-       asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
+       asm_volatile_goto(ALTERNATIVE(JUMP_TO_LEGACY, "nop", 0,
                                      RISCV_ISA_EXT_ZBB, 1)
                          : : : : legacy);
@@ -155,5 +157,5 @@ static __always_inline int variable_ffs(int x)
                return 0;

-       asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
+       asm_volatile_goto(ALTERNATIVE(JUMP_TO_LEGACY, "nop", 0,
                                      RISCV_ISA_EXT_ZBB, 1)
                          : : : : legacy);
@@ -210,5 +212,5 @@ static __always_inline int variable_fls(unsigned int x)
                return 0;

-       asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
+       asm_volatile_goto(ALTERNATIVE(JUMP_TO_LEGACY, "nop", 0,
                                      RISCV_ISA_EXT_ZBB, 1)
                          : : : : legacy);

(
It seems that GNU assembler cannot optimize out R_RISCV_JAL.
LLVM integrated assembler is able to optimize out R_RISCV_JAL and after the recent [RISCV][MC] Pass MCSubtargetInfo down to shouldForceRelocation and evaluateTargetFixup. change able to optimize out more.
I think the direction is correct.

% cat y.s
j .Ltmp1
j .Ltmp1
.Ltmp1:
% clang --target=riscv64-linux-gnu -march=rv64i -mno-relax -c -fno-integrated-as y.s && llvm-objdump -dr y.o

y.o:    file format elf64-littleriscv

Disassembly of section .text:

0000000000000000 <.text>:
       0: 6f 00 80 00   j       0x8 <.Ltmp1>
                0000000000000000:  R_RISCV_JAL  .Ltmp1
       4: 6f 00 40 00   j       0x8 <.Ltmp1>
                0000000000000004:  R_RISCV_JAL  .Ltmp1

)

@nathanchance
Copy link
Member Author

tl;dr I suspect that the kernel makes a brittle assumption but I am not familiar with the kernel enough to find it.

It is possible that there is an assumption around R_RISCV_JAL in the alternative patching code (https://elixir.bootlin.com/linux/v6.7-rc6/source/arch/riscv/kernel/alternative.c#L133), which came from https://git.kernel.org/linus/9d5567ccf96fb2f1eb83d531eee23ead4aa8f2a3 as a result of https://git.kernel.org/linus/27c653c06505f084bcb57f7575916d60efb32279. I spent some time today debugging in gdb and it seems like it is related to variable_fls(), as I see that in the call path before everything goes haywire. I still have not figured out where exactly it is crashing but 0xff6000001ffe9000 is the address of the ai parameter to pcpu_setup_first_chunk(), which is 0x40 off from the crashing address of 0xff6000001ffe9040.

I will note those two commits came into the kernel relatively recently but I see boot problems in 5.10, which is quite old... I do see a jal in the jump label code (https://elixir.bootlin.com/linux/v6.7-rc6/source/arch/riscv/include/asm/jump_label.h#L47), which could also be related.

@nathanchance
Copy link
Member Author

I spent some time today debugging in gdb and it seems like it is related to variable_fls(), as I see that in the call path before everything goes haywire.

Okay "goes haywire" may have been underselling it. Now that I have compared what happens in gdb on the good vs. the bad kernel, it is obvious what the problem is but not necessarily why (at least to me, although I may be missing something obvious, since my brain is somewhat fried...): we appear to skip over a section of pcpu_setup_first_chunk() and jump into pcpu_alloc_first_chunk(), which clearly should not happen.

On the good kernel (in pcpu_setup_first_chunk()):

(gdb)
2738            pcpu_sidelined_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
(gdb) s
__pcpu_size_to_slot (size=81920) at /home/nathan/cbl/src/linux/mm/percpu.c:231
231             int highbit = fls(size);        /* size is in bytes */
(gdb) s
variable_fls (x=81920) at /home/nathan/cbl/src/linux/arch/riscv/include/asm/bitops.h:209
209             if (!x)
(gdb) s
212             asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
(gdb) s
216             asm volatile (".option push\n"
(gdb) s
222             return 32 - r;
(gdb) s
__pcpu_size_to_slot (size=<optimized out>) at /home/nathan/cbl/src/linux/mm/percpu.c:232
232             return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
(gdb) s
pcpu_setup_first_chunk (ai=0xff6000001ffe9000, base_addr=0xff6000001ffd4000) at /home/nathan/cbl/src/linux/mm/percpu.c:2738
2738            pcpu_sidelined_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
(gdb) s
2739            pcpu_free_slot = pcpu_sidelined_slot + 1;
(gdb) s
2740            pcpu_to_depopulate_slot = pcpu_free_slot + 1;
(gdb) p pcpu_sidelined_slot
$1 = 15

On the bad kernel:

(gdb)
2738            pcpu_sidelined_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
(gdb) s
__pcpu_size_to_slot (size=81920) at /home/nathan/cbl/src/linux/mm/percpu.c:231
231             int highbit = fls(size);        /* size is in bytes */
(gdb)
variable_fls (x=81920) at /home/nathan/cbl/src/linux/arch/riscv/include/asm/bitops.h:209
209             if (!x)
(gdb)
212             asm_volatile_goto(ALTERNATIVE("j %l[legacy]", "nop", 0,
(gdb)
pcpu_alloc_first_chunk (tmp_addr=18401708077972664640, map_size=536690688) at /home/nathan/cbl/src/linux/mm/percpu.c:1357
1357            region_size = ALIGN(start_offset + map_size, PAGE_SIZE);

I have uploaded the object files of mm/percpu.o between the good and bad compilers for further introspection if necessary:

https://gist.github.com/nathanchance/7e22ced9a7bca1c69a372ca11f822104

@nathanchance
Copy link
Member Author

In continuing to try and debug this further, I have noticed two things.

  • One runtime difference between the good and the bad toolchain is ZBB is detected and patched via alternatives when the good toolchain builds the kernel but not the bad toolchain, which I can see by adding a log statement in if (!__riscv_isa_extension_available(NULL, id)) (id=30 is RISCV_ISA_EXT_ZBB) but I am not sure why that change would occur...
  • On 5.15, the panic happens even earlier and it seems to be related to jump labels because things go awry after a static key gets increased (I see static_branch_inc(&sched_clock_running); called sched_clock_init() then it dies in generic_sched_clock_init() after hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL_HARD); with
[    0.011199] Unable to handle kernel paging request at virtual address 0000000001000004
[    0.014362] Oops [#1]
[    0.014453] Modules linked in:
[    0.014981] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.144-00001-g19b4e293c319 #1
[    0.015120] Hardware name: riscv-virtio,qemu (DT)
[    0.015235] epc : do_raw_spin_lock+0xa/0x1e
[    0.015824]  ra : _raw_spin_lock+0x1a/0x22
[    0.015859] epc : ffffffff80050778 ra : ffffffff80649ff6 sp : ffffffff81403ef0
[    0.015892]  gp : ffffffff814de0a8 tp : ffffffff8140bd80 t0 : 0000000000000000
[    0.015924]  t1 : 0000000000000001 t2 : 0000000000000400 s0 : ffffffff81403f10
[    0.015955]  s1 : ffffffe01fde8bc0 a0 : 0000000001000000 a1 : 0000000000000002
[    0.015987]  a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000004
[    0.016018]  a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43
[    0.016048]  s2 : 0000000200000020 s3 : ffffffff815113e8 s4 : 0000000000000009
[    0.016240]  s5 : ffffffe01fde8bc0 s6 : ffffffe01fde8c40 s7 : 0000000000000000
[    0.016278]  s8 : ffffffff814857c0 s9 : ffffffe01fde8bc9 s10: ffffffe01fde8c49
[    0.016312]  s11: 0000000000000000 t3 : 00000000000002e2 t4 : 0000000000000000
[    0.016345]  t5 : 00000000000002d8 t6 : 0000000000000000
[    0.016375] status: 0000000200000100 badaddr: 0000000001000004 cause: 000000000000000d
[    0.016477] [<ffffffff80050778>] do_raw_spin_lock+0xa/0x1e
[    0.016564] [<ffffffff80649ff6>] _raw_spin_lock+0x1a/0x22
[    0.016592] [<ffffffff8006f67e>] .LBB10_17+0x36/0x68
[    0.016633] [<ffffffff8080ab5a>] .Lpcrel_hi13+0x18/0x24
[    0.016661] [<ffffffff80807684>] .Lpcrel_hi1+0x18/0x24
[    0.016687] [<ffffffff80800584>] .LBB19_12+0x4/0x72
[    0.017437] ---[ end trace fe60ea4398d6ec42 ]---
[    0.017671] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.017844] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

@nathanchance
Copy link
Member Author

I can avoid this issue on 5.15 by disabling CONFIG_JUMP_LABEL. That does not work for the issue on mainline, which aligns with my theory that if this is a poor assumption by the kernel, it is likely in both the jump label code and the alternatives code. Disabling alternatives appears to be rather difficult for ARCH=riscv from what I can tell.

@nickdesaulniers
Copy link
Member

cc @hiraditya as rolling past llvm/llvm-project@e87f33d will likely break RISCV linux kernel from booting for Android.

@nathanchance
Copy link
Member Author

rolling past llvm/llvm-project@e87f33d will likely break RISCV linux kernel from booting for Android.

Confirmed, ARCH=riscv gki_defconfig is broken: https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/7443861675/job/20250326516

@ebiggers
Copy link

ebiggers commented Jan 9, 2024

This isn't caused by the Linux kernel's alternatives patching, as the crash occurs prior to that. I've put together a standalone reproducer:

file1.c:

extern void f(void);

int main()
{
        asm goto(
                 ".option push\n"
                 ".option norvc\n"
                 ".option norelax\n"
                 "j %[label]\n"
                 ".option pop\n" :::: label);

        f();
        f();
        f();
        f();
label:
        return 0;
}

file2.c:

void f(void)
{
}

commands:

clang -target riscv64-linux-gnu -O2 -c file1.c
clang -target riscv64-linux-gnu -O2 -c file2.c
ld.lld file1.o file2.o
llvm-objdump -trd file1.o
llvm-objdump -d a.out

The object file file1.o has the correct jump from address 0x4 to address 0x28:

0000000000000000 <main>:
       0: 41 11        	addi	sp, sp, -16
       2: 06 e4        	sd	ra, 8(sp)
       4: 6f 00 40 02  	j	0x28 <main+0x28>
       8: 97 00 00 00  	auipc	ra, 0
		0000000000000008:  R_RISCV_CALL_PLT	f
		0000000000000008:  R_RISCV_RELAX	*ABS*
       c: e7 80 00 00  	jalr	ra
      10: 97 00 00 00  	auipc	ra, 0
		0000000000000010:  R_RISCV_CALL_PLT	f
		0000000000000010:  R_RISCV_RELAX	*ABS*
      14: e7 80 00 00  	jalr	ra
      18: 97 00 00 00  	auipc	ra, 0
		0000000000000018:  R_RISCV_CALL_PLT	f
		0000000000000018:  R_RISCV_RELAX	*ABS*
      1c: e7 80 00 00  	jalr	ra
      20: 97 00 00 00  	auipc	ra, 0
		0000000000000020:  R_RISCV_CALL_PLT	f
		0000000000000020:  R_RISCV_RELAX	*ABS*
      24: e7 80 00 00  	jalr	ra
      28: 01 45        	li	a0, 0
      2a: a2 60        	ld	ra, 8(sp)
      2c: 41 01        	addi	sp, sp, 16
      2e: 82 80        	ret

But after linking, the jump is jumping off the end of the function:

00000000000111a0 <main>:
   111a0: 41 11        	addi	sp, sp, -16
   111a2: 06 e4        	sd	ra, 8(sp)
   111a4: 6f 00 40 02  	j	0x111c8 <f+0x8>
   111a8: ef 00 80 01  	jal	0x111c0 <f>
   111ac: ef 00 40 01  	jal	0x111c0 <f>
   111b0: ef 00 00 01  	jal	0x111c0 <f>
   111b4: ef 00 c0 00  	jal	0x111c0 <f>
   111b8: 01 45        	li	a0, 0
   111ba: a2 60        	ld	ra, 8(sp)
   111bc: 41 01        	addi	sp, sp, 16
   111be: 82 80        	ret

00000000000111c0 <f>:
   111c0: 82 80        	ret

The distance jumped is still 0x24, so apparently the bug is that when the code was shrunk by "relaxations", the jump distance was not decreased accordingly.

Passing --no-relax to the linker avoids the bug, as previously observed.

MaskRay added a commit to MaskRay/llvm-project that referenced this issue Jan 9, 2024
…Relax

Regarding
```
.option norelax
j label
.option relax
// relaxable instructions
label:
```

The J instruction needs a relocation to ensure the target is correct
after linker relaxation. This is related a limitation in the assembler:
RISCVAsmBackend::shouldForceRelocation decides upfront whether a
relocation is needed, instead of checking more information (whether
there are relaxable fragments in between).

Despite the limitation, `j label` produces a relocation in direct object
emission mode, but was broken by llvm#73721 due to the shouldForceRelocation
limitation.

Add a workaround to RISCVTargetELFStreamer to emulate the previous
behavior.

Link: ClangBuiltLinux/linux#1965
@nathanchance
Copy link
Member Author

I can confirm that llvm/llvm-project#77436 resolves this issue for me.

@nathanchance nathanchance added [PATCH] Submitted A patch has been submitted for review and removed [BUG] Untriaged Something isn't working labels Jan 9, 2024
MaskRay added a commit to MaskRay/llvm-project that referenced this issue Jan 9, 2024
MaskRay added a commit to llvm/llvm-project that referenced this issue Jan 9, 2024
…Relax (#77436)

Regarding
```
.option norelax
j label
.option relax
// relaxable instructions
// For assembly input, RISCVAsmParser::ParseInstruction will set ForceRelocs (https://reviews.llvm.org/D46423).
// For direct object emission, ForceRelocs is not set after #73721
label:
```

The J instruction needs a relocation to ensure the target is correct
after linker relaxation. This is related a limitation in the assembler:
RISCVAsmBackend::shouldForceRelocation decides upfront whether a
relocation is needed, instead of checking more information (whether
there are relaxable fragments in between).

Despite the limitation, `j label` produces a relocation in direct object
emission mode, but was broken by #73721 due to the shouldForceRelocation
limitation.

Add a workaround to RISCVTargetELFStreamer to emulate the previous
behavior.

Link: ClangBuiltLinux/linux#1965
@MaskRay
Copy link
Member

MaskRay commented Jan 9, 2024

@ebiggers Thanks for the detailed reproduce, @nathanchance for testing and @topperc for approving llvm/llvm-project#77436 .

This issue has been resolved by llvm/llvm-project#77436

@nathanchance nathanchance added [FIXED][LLVM] main This bug was only present and fixed in an unreleased version of LLVM and removed [PATCH] Submitted A patch has been submitted for review labels Jan 9, 2024
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this issue Jan 28, 2024
…Relax (llvm#77436)

Regarding
```
.option norelax
j label
.option relax
// relaxable instructions
// For assembly input, RISCVAsmParser::ParseInstruction will set ForceRelocs (https://reviews.llvm.org/D46423).
// For direct object emission, ForceRelocs is not set after llvm#73721
label:
```

The J instruction needs a relocation to ensure the target is correct
after linker relaxation. This is related a limitation in the assembler:
RISCVAsmBackend::shouldForceRelocation decides upfront whether a
relocation is needed, instead of checking more information (whether
there are relaxable fragments in between).

Despite the limitation, `j label` produces a relocation in direct object
emission mode, but was broken by llvm#73721 due to the shouldForceRelocation
limitation.

Add a workaround to RISCVTargetELFStreamer to emulate the previous
behavior.

Link: ClangBuiltLinux/linux#1965
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] risc-v This bug impacts ARCH=riscv [BUG] llvm (main) A bug in an unreleased version of LLVM (this label is appropriate for regressions) [FIXED][LLVM] main This bug was only present and fixed in an unreleased version of LLVM
Projects
None yet
Development

No branches or pull requests

5 participants