Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unprofitable SCEV/INDVARS transform #56153

Open
nickdesaulniers opened this issue Jun 21, 2022 · 17 comments
Open

unprofitable SCEV/INDVARS transform #56153

nickdesaulniers opened this issue Jun 21, 2022 · 17 comments

Comments

@nickdesaulniers
Copy link
Member

target triple = "armv4t-unknown-linux-gnueabi"

@scpart_scan_partmap_offs = external global i64

define void @scpart_scan_partmap() {
entry:
  br label %while.body.preheader

while.body.preheader:                             ; preds = %entry
  br label %while.body

while.body:                                       ; preds = %while.body, %while.body.preheader
  %add13 = phi i64 [ %add, %while.body ], [ undef, %while.body.preheader ]
  %add = add nsw i64 %add13, 12
  %cmp = icmp slt i64 %add13, 0
  br i1 %cmp, label %while.body, label %while.cond.while.end_crit_edge

while.cond.while.end_crit_edge:                   ; preds = %while.body
  %add.lcssa = phi i64 [ %add13, %while.body ]
  store i64 %add.lcssa, ptr null, align 8
  ret void
}
$ opt -passes=indvars reduced.ll -S

produces

source_filename = "reduced.ll"
target triple = "armv4t-unknown-linux-gnueabi"

@scpart_scan_partmap_offs = external global i64

define void @scpart_scan_partmap() {
entry:
  br label %while.body.preheader

while.body.preheader:                             ; preds = %entry
  %smax = call i64 @llvm.smax.i64(i64 undef, i64 0)
  %0 = sub i64 %smax, undef
  %umin = call i64 @llvm.umin.i64(i64 %0, i64 1)
  %1 = sub i64 %0, %umin
  %2 = udiv i64 %1, 12
  %3 = add i64 %umin, %2
  %4 = mul i64 %3, 12
  br label %while.body

while.body:                                       ; preds = %while.body, %while.body.preheader
  br i1 false, label %while.body, label %while.cond.while.end_crit_edge

while.cond.while.end_crit_edge:                   ; preds = %while.body
  %5 = add i64 %4, undef
  store i64 %5, ptr null, align 8
  ret void
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.smax.i64(i64, i64) #0

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.umin.i64(i64, i64) #0

attributes #0 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

unfortunately the udiv we got from SCEV is going to be lowered to an relatively expensive libcall to __aeabi_uldivmod (32b x86 is also affected, the libcall is __udivdi3). So we may have eliminated the loop body, but we replaced it with a relatively expensive libcall.

Forked from ClangBuiltLinux/linux#1635. ClangBuiltLinux/linux#1635 (comment) has the reduced C code, and demonstrates GCC keeping the tight loop:

      24: 04 f0 9d a4   ldrge   pc, [sp], #4
      28: 0c 30 93 e2   adds    r3, r3, #12
      2c: 00 20 a2 e2   adc     r2, r2, #0
      30: 01 00 53 e1   cmp     r3, r1
      34: 00 c0 d2 e0   sbcs    r12, r2, r0
      38: fa ff ff ba   blt     0x28 <scpart_scan_partmap+0x28> @ imm = #-24

llvm::rewriteLoopExitValues in llvm/lib/Transforms/Utils/LoopUtils.cpp tries to replace the loop body regardless of cost if the loop can be deleted. I'm not sure if we could exclude such transforms if the SCEV contains a udiv larger than the target word size, or if the target doesn't support division ARMSubTarget::hasDivideInARMMode(), or something more generic?

See also https://reviews.llvm.org/D9800 / e2538b5,

cc @LebedevRI @fhahn @sjoerdmeijer @wmi-11 @preames @atrick

@llvmbot
Copy link
Member

llvmbot commented Jun 21, 2022

@llvm/issue-subscribers-backend-arm

@llvmbot
Copy link
Member

llvmbot commented Jun 21, 2022

@llvm/issue-subscribers-backend-x86

@nickdesaulniers
Copy link
Member Author

FWIW, backtrace looks like:

#0  0x000000000296c460 in llvm::ScalarEvolution::getUDivExpr(llvm::SCEV const*, llvm::SCEV const*) ()
#1  0x00000000029850ae in llvm::ScalarEvolution::getUDivCeilSCEV(llvm::SCEV const*, llvm::SCEV const*) ()
#2  0x00000000029934be in llvm::ScalarEvolution::howManyLessThans(llvm::SCEV const*, llvm::SCEV const*, llvm::Loop const*, bool, bool, bool) ()
#3  0x000000000298e4b7 in llvm::ScalarEvolution::computeExitLimitFromICmp(llvm::Loop const*, llvm::CmpInst::Predicate, llvm::SCEV const*, llvm::SCEV const*, bool, bool) ()
#4  0x000000000298e144 in llvm::ScalarEvolution::computeExitLimitFromICmp(llvm::Loop const*, llvm::ICmpInst*, bool, bool, bool) ()
#5  0x000000000298d66c in llvm::ScalarEvolution::computeExitLimitFromCondImpl(llvm::ScalarEvolution::ExitLimitCache&, llvm::Loop const*, llvm::Value*, bool, bool, bool) ()
#6  0x000000000298d283 in llvm::ScalarEvolution::computeExitLimitFromCondCached(llvm::ScalarEvolution::ExitLimitCache&, llvm::Loop const*, llvm::Value*, bool, bool, bool) ()
#7  0x000000000298ca67 in llvm::ScalarEvolution::computeExitLimit(llvm::Loop const*, llvm::BasicBlock*, bool) ()
#8  0x000000000298868e in llvm::ScalarEvolution::computeBackedgeTakenCount(llvm::Loop const*, bool) ()
#9  0x0000000002987134 in llvm::ScalarEvolution::getBackedgeTakenInfo(llvm::Loop const*) ()
#10 0x0000000002980992 in llvm::ScalarEvolution::getRangeRef(llvm::SCEV const*, llvm::ScalarEvolution::RangeSignHint) ()
#11 0x0000000002971779 in StrengthenNoWrapFlags(llvm::ScalarEvolution*, llvm::SCEVTypes, llvm::ArrayRef<llvm::SCEV const*>, llvm::SCEV::NoWrapFlags) ()
#12 0x00000000029640d0 in llvm::ScalarEvolution::getAddExpr(llvm::SmallVectorImpl<llvm::SCEV const*>&, llvm::SCEV::NoWrapFlags, unsigned int) ()
#13 0x0000000002976af9 in llvm::ScalarEvolution::createSCEV(llvm::Value*) ()
#14 0x000000000296ecd9 in llvm::ScalarEvolution::getSCEV(llvm::Value*) ()
#15 0x0000000003b1d1a1 in llvm::simplifyUsersOfIV(llvm::PHINode*, llvm::ScalarEvolution*, llvm::DominatorTree*, llvm::LoopInfo*, llvm::TargetTransformInfo const*, llvm::SmallVectorImpl<llvm::WeakTrackingVH>&, llvm::SCEVExpander&, llvm::IVVisitor*) ()
#16 0x00000000036cff3d in (anonymous namespace)::IndVarSimplify::run(llvm::Loop*) ()

so llvm::ScalarEvolution::howManyLessThans can end up inserting divisions.

@preames
Copy link
Collaborator

preames commented Jun 21, 2022

Having a backedge expression result in a udiv is pretty fundamental. That's the core goal of SCEV's design.

We could potentially exploring making RLEV cost based, but we've talked about that before, and we have pretty solid reasons for wanting to treat it as a canonical form.

On the target which doesn't have udiv, can you give a feeling for the relative cost of the libcall and loop? Are we talking small constant factor differences or much more extreme? I'm basically wondering if this is a costing question, or a legality one in disguise.

@efriedma-quic
Copy link
Collaborator

The compiler-rt routines for division are not very well optimized; we're probably talking about hundreds of cycles if we're using the 64-bit target-independent divide routine on a 32-bit target. We could probably do better, but still pretty expensive in any case. For x86 specifically, we do have an optimized assembly routine for __udivdi3, though.

We also don't have any compiler optimizations for 64-bit divide by constant on 32-bit targets; we could probably emit inline code sequences that are a lot faster than the generic routine in many cases. (For division by 12, Arm gcc emits an inline sequence of roughly 20 instructions.)

But given the code sequences the compiler currently emits, we really don't want indvars to emit division in any case where it would be lowered to a libcall (not power of two, no legal instruction, can't use divide->multiply transform).

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jul 15, 2022

This is becoming a more pressing issue for us, particularly after 3bc09c7 as it's now going to start blocking our CI. cc @nathanchance

@nickdesaulniers
Copy link
Member Author

The compiler-rt routines for division are not very well optimized; we're probably talking about hundreds of cycles if we're using the 64-bit target-independent divide routine on a 32-bit target.

Specifically for 32b arm, 64b division as implemented by __aeabi_uldivmod is mostly a call to __udivmoddi4 (which is huge).

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jul 18, 2022

we really don't want indvars to emit division in any case where it would be lowered to a libcall (not power of two, no legal instruction, can't use divide->multiply transform).

Any suggestion on existing machinery to check for this?

SCEVExpander::isHighCostExpansion might be a good place to do this check, but LoopUtils::rewriteLoopExitValues() will still override the results of Phi.HighCost if LoopCanBeDel...

cc @sjoerdmeijer

@nikic
Copy link
Contributor

nikic commented Jul 18, 2022

What kind of cost does the udiv get for this target? Possibly we can make LoopCanBeDel still use HighCost and just increase the budget by a lot? The normal budget is 4.

@preames
Copy link
Collaborator

preames commented Jul 18, 2022

Nick and I chatted a bit offline, here is my summary.

Motivation here is that kernel essentially has a dialect of C which disallows 64 bit divides on 32 bit platforms. There is a kernel function which provides a 64 bit divide if needed, but the kernel expects the compiler not to insert calls to runtime routines for this purpose. Nick has a patch which implemented the runtime function in terms of the kernel function; something about that didn't work, I didn't entirely grasp what.

RLEV is a canonicalization. Turning it off impacts other transformations like e.g. LoopVectorizer. We need to be careful to make sure that any negative impact is restrict to kernel code, and not generic user land code for the platform.

This concern makes it hard to use the cost model approach. We can introduce a high cost threshold for the dead loop case, but how to we arrange for the udiv to have infinite cost in kernel code only? We don't currently have a good answer to that.

I suggested two near term approaches. Both of these essentially punt on the root issue, and instead focus on simply getting rid of these particular udivs.

  1. Explore whether we have a peephole which could replace the udiv idiom on these examples. The udiv_ceiling reasoning - which is where the min/max/udiv patterns are coming from - is fairly recent, so we may simply have a missing peephole.

  2. See if we can get the relevant 32-bit backends to expand the arithmetic sequences (rather than libcall) in a generally profitable way. This may run into the same "how do we penalize only kernel code" question as above, but we might get lucky and be able to find a generally profitable subcase which covers all of the practical examples.

@nickdesaulniers
Copy link
Member Author

nickdesaulniers commented Jul 18, 2022

We also don't have any compiler optimizations for 64-bit divide by constant on 32-bit targets; we could probably emit inline code sequences that are a lot faster than the generic routine in many cases. (For division by 12, Arm gcc emits an inline sequence of roughly 20 instructions.)

See if we can get the relevant 32-bit backends to expand the arithmetic sequences (rather than libcall) in a generally profitable way. This may run into the same "how do we penalize only kernel code" question as above, but we might get lucky and be able to find a generally profitable subcase which covers all of the practical examples.

Right, seems like we might be missing this.

@efriedma-quic
Copy link
Collaborator

We're specifically missing the "divide udword by dword" part, yes.

I think https://gmplib.org/~tege/division-paper.pdf is the current state of the art. See also https://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html , https://0x414b.com/2021/04/16/arm-division.html .

@nickdesaulniers
Copy link
Member Author

Is DAGTypeLegalizer::ExpandIntRes_UDIV the right place to modify to improve division by constant?

@efriedma-quic
Copy link
Collaborator

The existing code for handling unsigned divide by constant is in TargetLowering::BuildUDIV.

@topperc
Copy link
Collaborator

topperc commented Aug 1, 2022

I started implementing something like what gcc does here https://reviews.llvm.org/D130862

@nickdesaulniers
Copy link
Member Author

From a discussion with @topperc on discourse:

do you know what all constants linux needs so I can prioritize? I see the PR uses 12

ClangBuiltLinux/linux#1635 is urem/udiv on arm32 and i386 of 64b with constant 12.
ClangBuiltLinux/linux#1679 is urem/udiv on ppc32 of 64b with constant 5.
ClangBuiltLinux/linux#1666 is udiv on arm32 of 64b with a non-constant/non-immediate.

@nikic
Copy link
Contributor

nikic commented Mar 9, 2023

I think the backend improvements for division by constants have landed -- can this issue be closed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants