-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clang 15 built kernel crashes w. "BUG: kernel NULL pointer dereference, address: 00000000", gcc 12 built kernel with same config boots fine (6.1-rc7, x86_32) #1766
Comments
Based on the stack trace, it seems like the maple tree is involved. Liam Howlett, the maple tree maintainer, has been pretty responsive to bug reports from what I can tell:
It would be interesting to see if this is reproducible in a virtual machine, which would make debugging it simpler.
We have occasionally had issues that turned out to be kernel bugs due to UB or other subtleties that only show up with clang. |
I've spent the last few days recreating this bug and finally arrived at the conclusion that it is a clang-15 bug Necessary background into the call stack from the crash is that we are under the mmap_lock which means only the write operation is occurring in this mm struct. The debug output uses the mm_struct pointer to identify that it is indeed the same task printing both messages to the console. I made the following changes (among other debug outputs, so the lines won't match but the logic is sound):
And received the following output:
Followed by the reported crash. So somehow the boolean return of 'true' is not treated as true which results in a very similar crash as initially reported. I also recreated the situation in my userspace test code and it seems to work there, so I'm not sure what else is at play to cause the logic failure. Recreation required clang-15, v6.1-rc7, and the use of the provided config. Although I did make modifications to have qemu/kvm reproduce the issue so I've attached that here as well. Rebuilding with clang-14 or gcc allows for the machine to boot smoothly. It is worth noting that I used debian for ease of testing. |
I attached the clang-14 config before. Here is the clang-15 config that causes the failure Take a look at the config option CONFIG_ZERO_CALL_USED_REGS as that seems to be one that matters. This seems to enable -fzero-call-used-regs=used-gpr |
Thanks for looking into this closer @howlett .
Makes sense that this could be This smells like llvm/llvm-project#57692. I noticed in the attached config that:
@howlett can you confirm whether your build of clang 15.0.6 contains d4bada99c069e2edbee2f4c815598476e7508f0b? Perhaps it's possible that the version of clang 15 was incremented to 15.0.6 before d4bada99c069e2edbee2f4c815598476e7508f0b landed? |
Confirmed, testing was done with 15.0.6 $ clang-15 --version Build command for the kernel: |
I can see the bug in the disassembly:
So Simply having a function return 1 and testing that with |
Just doing a command line reduction, I needed at least |
Initial reduction: // clang -m32 -fno-pic -march=i686 -O2 -fzero-call-used-regs=used-gpr -c maple_tree.i -S -o -
struct maple_arange_64 {
long pivot[1]
};
enum maple_type { maple_arange_64 };
struct {
struct {
struct maple_arange_64 ma64
};
} * mas_data_end___trans_tmp_2;
char mt_slots_0, mt_pivots_0, ma_meta_end_mn_0_0_0_0_0_0;
struct maple_big_node {
char b_end
};
struct maple_subtree_state {
struct maple_big_node *bn
} ma_is_leaf();
enum maple_type mas_data_end_type;
char mas_data_end() {
char offset;
if (mas_data_end_type)
return ma_meta_end_mn_0_0_0_0_0_0;
offset = mt_pivots_0 - 1;
if (__builtin_expect(mas_data_end___trans_tmp_2->ma64.pivot[offset], 1))
return offset;
return mt_pivots_0;
}
void mas_mab_cp(char);
_Bool mas_prev_sibling();
_Bool mas_push_data(struct maple_subtree_state *mast) {
unsigned char slot_total = mast->bn->b_end, end, space;
if (mas_prev_sibling())
end = mas_data_end();
space = mt_slots_0;
ma_is_leaf();
if (slot_total >= space)
return mast;
mas_mab_cp(end);
} let me see if I can reduce this further, but this disassembly very clearly has: movb $1, %al <- store 1 to %eax
addl $24, %esp
.cfi_def_cfa_offset 8
popl %ebx
.cfi_def_cfa_offset 4
xorl %eax, %eax <- store 0 to %eax
xorl %ecx, %ecx
xorl %edx, %edx
retl |
Yeah, that looks like grossness. I'll investigate. |
I think I know what's going on. The code uses both |
Should be fixed now. |
This bug was bad enough that I think we should consider marking ZERO_CALL_USED_REGS broken with clang-15; unless we can get upstream llvm to consider a clang 15.0.7 release for this. |
We could preemptively do something like this, which accounts for 15.0.7 existing or not: diff --git a/security/Kconfig.hardening b/security/Kconfig.hardening
index d766b7d0ffd1..ddf9b411a3dd 100644
--- a/security/Kconfig.hardening
+++ b/security/Kconfig.hardening
@@ -257,6 +257,8 @@ config INIT_ON_FREE_DEFAULT_ON
config CC_HAS_ZERO_CALL_USED_REGS
def_bool $(cc-option,-fzero-call-used-regs=used-gpr)
+ # https://github.com/ClangBuiltLinux/linux/issues/1766
+ depends on !CC_IS_CLANG || CLANG_VERSION > 150006
config ZERO_CALL_USED_REGS
bool "Enable register zeroing on function exit" |
I have sent the above patch: https://lore.kernel.org/20221214232602.4118147-1-nathan@kernel.org/ |
Just for the records and to confirm this is FIXED with LLVM
NOTE: Toolchain: https://github.com/samitolvanen/llvm-project/commits/15.x/kcfi |
This is an interesting one!
Gave 6.1-rc7 a test ride on ye goode olde Pentium 4 box and noticed while the kernel boots just fine when built with gcc 12 toolchain it crashes at boot when it is built with clang 15 toolchain, same kernel .config used.
This is reproducable and happens everytime at boot on this machine;
Some data about the machine:
If you think it would be a good idea I could mail a bug report to linux-mm too.
dmesg_61-rc7_p4_clang.txt
dmesg_61-rc7_p4_gcc.txt
config_61-rc7_p4-clang.txt
config_61-rc7_p4-gcc.txt
The text was updated successfully, but these errors were encountered: