-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unprofitable SCEV/INDVARS transform #56153
Comments
@llvm/issue-subscribers-backend-arm |
@llvm/issue-subscribers-backend-x86 |
FWIW, backtrace looks like:
so |
Having a backedge expression result in a udiv is pretty fundamental. That's the core goal of SCEV's design. We could potentially exploring making RLEV cost based, but we've talked about that before, and we have pretty solid reasons for wanting to treat it as a canonical form. On the target which doesn't have udiv, can you give a feeling for the relative cost of the libcall and loop? Are we talking small constant factor differences or much more extreme? I'm basically wondering if this is a costing question, or a legality one in disguise. |
The compiler-rt routines for division are not very well optimized; we're probably talking about hundreds of cycles if we're using the 64-bit target-independent divide routine on a 32-bit target. We could probably do better, but still pretty expensive in any case. For x86 specifically, we do have an optimized assembly routine for __udivdi3, though. We also don't have any compiler optimizations for 64-bit divide by constant on 32-bit targets; we could probably emit inline code sequences that are a lot faster than the generic routine in many cases. (For division by 12, Arm gcc emits an inline sequence of roughly 20 instructions.) But given the code sequences the compiler currently emits, we really don't want indvars to emit division in any case where it would be lowered to a libcall (not power of two, no legal instruction, can't use divide->multiply transform). |
This is becoming a more pressing issue for us, particularly after 3bc09c7 as it's now going to start blocking our CI. cc @nathanchance |
Specifically for 32b arm, 64b division as implemented by |
Any suggestion on existing machinery to check for this? SCEVExpander::isHighCostExpansion might be a good place to do this check, but LoopUtils::rewriteLoopExitValues() will still override the results of |
What kind of cost does the udiv get for this target? Possibly we can make LoopCanBeDel still use HighCost and just increase the budget by a lot? The normal budget is 4. |
Nick and I chatted a bit offline, here is my summary. Motivation here is that kernel essentially has a dialect of C which disallows 64 bit divides on 32 bit platforms. There is a kernel function which provides a 64 bit divide if needed, but the kernel expects the compiler not to insert calls to runtime routines for this purpose. Nick has a patch which implemented the runtime function in terms of the kernel function; something about that didn't work, I didn't entirely grasp what. RLEV is a canonicalization. Turning it off impacts other transformations like e.g. LoopVectorizer. We need to be careful to make sure that any negative impact is restrict to kernel code, and not generic user land code for the platform. This concern makes it hard to use the cost model approach. We can introduce a high cost threshold for the dead loop case, but how to we arrange for the udiv to have infinite cost in kernel code only? We don't currently have a good answer to that. I suggested two near term approaches. Both of these essentially punt on the root issue, and instead focus on simply getting rid of these particular udivs.
|
Right, seems like we might be missing this. |
We're specifically missing the "divide udword by dword" part, yes. I think https://gmplib.org/~tege/division-paper.pdf is the current state of the art. See also https://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html , https://0x414b.com/2021/04/16/arm-division.html . |
Is |
The existing code for handling unsigned divide by constant is in TargetLowering::BuildUDIV. |
I started implementing something like what gcc does here https://reviews.llvm.org/D130862 |
From a discussion with @topperc on discourse:
ClangBuiltLinux/linux#1635 is urem/udiv on arm32 and i386 of 64b with constant 12. |
I think the backend improvements for division by constants have landed -- can this issue be closed now? |
produces
unfortunately the
udiv
we got from SCEV is going to be lowered to an relatively expensive libcall to__aeabi_uldivmod
(32b x86 is also affected, the libcall is__udivdi3
). So we may have eliminated the loop body, but we replaced it with a relatively expensive libcall.Forked from ClangBuiltLinux/linux#1635. ClangBuiltLinux/linux#1635 (comment) has the reduced C code, and demonstrates GCC keeping the tight loop:
llvm::rewriteLoopExitValues
in llvm/lib/Transforms/Utils/LoopUtils.cpp tries to replace the loop body regardless of cost if the loop can be deleted. I'm not sure if we could exclude such transforms if the SCEV contains a udiv larger than the target word size, or if the target doesn't support divisionARMSubTarget::hasDivideInARMMode()
, or something more generic?See also https://reviews.llvm.org/D9800 / e2538b5,
cc @LebedevRI @fhahn @sjoerdmeijer @wmi-11 @preames @atrick
The text was updated successfully, but these errors were encountered: