unprofitable SCEV/INDVARS transform #56153

nickdesaulniers · 2022-06-21T21:58:41Z

target triple = "armv4t-unknown-linux-gnueabi"

@scpart_scan_partmap_offs = external global i64

define void @scpart_scan_partmap() {
entry:
  br label %while.body.preheader

while.body.preheader:                             ; preds = %entry
  br label %while.body

while.body:                                       ; preds = %while.body, %while.body.preheader
  %add13 = phi i64 [ %add, %while.body ], [ undef, %while.body.preheader ]
  %add = add nsw i64 %add13, 12
  %cmp = icmp slt i64 %add13, 0
  br i1 %cmp, label %while.body, label %while.cond.while.end_crit_edge

while.cond.while.end_crit_edge:                   ; preds = %while.body
  %add.lcssa = phi i64 [ %add13, %while.body ]
  store i64 %add.lcssa, ptr null, align 8
  ret void
}

$ opt -passes=indvars reduced.ll -S

produces

source_filename = "reduced.ll"
target triple = "armv4t-unknown-linux-gnueabi"

@scpart_scan_partmap_offs = external global i64

define void @scpart_scan_partmap() {
entry:
  br label %while.body.preheader

while.body.preheader:                             ; preds = %entry
  %smax = call i64 @llvm.smax.i64(i64 undef, i64 0)
  %0 = sub i64 %smax, undef
  %umin = call i64 @llvm.umin.i64(i64 %0, i64 1)
  %1 = sub i64 %0, %umin
  %2 = udiv i64 %1, 12
  %3 = add i64 %umin, %2
  %4 = mul i64 %3, 12
  br label %while.body

while.body:                                       ; preds = %while.body, %while.body.preheader
  br i1 false, label %while.body, label %while.cond.while.end_crit_edge

while.cond.while.end_crit_edge:                   ; preds = %while.body
  %5 = add i64 %4, undef
  store i64 %5, ptr null, align 8
  ret void
}

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.smax.i64(i64, i64) #0

; Function Attrs: nocallback nofree nosync nounwind readnone speculatable willreturn
declare i64 @llvm.umin.i64(i64, i64) #0

attributes #0 = { nocallback nofree nosync nounwind readnone speculatable willreturn }

unfortunately the udiv we got from SCEV is going to be lowered to an relatively expensive libcall to __aeabi_uldivmod (32b x86 is also affected, the libcall is __udivdi3). So we may have eliminated the loop body, but we replaced it with a relatively expensive libcall.

Forked from ClangBuiltLinux/linux#1635. ClangBuiltLinux/linux#1635 (comment) has the reduced C code, and demonstrates GCC keeping the tight loop:

      24: 04 f0 9d a4   ldrge   pc, [sp], #4
      28: 0c 30 93 e2   adds    r3, r3, #12
      2c: 00 20 a2 e2   adc     r2, r2, #0
      30: 01 00 53 e1   cmp     r3, r1
      34: 00 c0 d2 e0   sbcs    r12, r2, r0
      38: fa ff ff ba   blt     0x28 <scpart_scan_partmap+0x28> @ imm = #-24

llvm::rewriteLoopExitValues in llvm/lib/Transforms/Utils/LoopUtils.cpp tries to replace the loop body regardless of cost if the loop can be deleted. I'm not sure if we could exclude such transforms if the SCEV contains a udiv larger than the target word size, or if the target doesn't support division ARMSubTarget::hasDivideInARMMode(), or something more generic?

See also https://reviews.llvm.org/D9800 / e2538b5,

cc @LebedevRI @fhahn @sjoerdmeijer @wmi-11 @preames @atrick

The text was updated successfully, but these errors were encountered:

llvmbot · 2022-06-21T21:59:03Z

@llvm/issue-subscribers-backend-arm

llvmbot · 2022-06-21T22:00:38Z

@llvm/issue-subscribers-backend-x86

nickdesaulniers · 2022-06-21T22:30:51Z

FWIW, backtrace looks like:

#0  0x000000000296c460 in llvm::ScalarEvolution::getUDivExpr(llvm::SCEV const*, llvm::SCEV const*) ()
#1  0x00000000029850ae in llvm::ScalarEvolution::getUDivCeilSCEV(llvm::SCEV const*, llvm::SCEV const*) ()
#2  0x00000000029934be in llvm::ScalarEvolution::howManyLessThans(llvm::SCEV const*, llvm::SCEV const*, llvm::Loop const*, bool, bool, bool) ()
#3  0x000000000298e4b7 in llvm::ScalarEvolution::computeExitLimitFromICmp(llvm::Loop const*, llvm::CmpInst::Predicate, llvm::SCEV const*, llvm::SCEV const*, bool, bool) ()
#4  0x000000000298e144 in llvm::ScalarEvolution::computeExitLimitFromICmp(llvm::Loop const*, llvm::ICmpInst*, bool, bool, bool) ()
#5  0x000000000298d66c in llvm::ScalarEvolution::computeExitLimitFromCondImpl(llvm::ScalarEvolution::ExitLimitCache&, llvm::Loop const*, llvm::Value*, bool, bool, bool) ()
#6  0x000000000298d283 in llvm::ScalarEvolution::computeExitLimitFromCondCached(llvm::ScalarEvolution::ExitLimitCache&, llvm::Loop const*, llvm::Value*, bool, bool, bool) ()
#7  0x000000000298ca67 in llvm::ScalarEvolution::computeExitLimit(llvm::Loop const*, llvm::BasicBlock*, bool) ()
#8  0x000000000298868e in llvm::ScalarEvolution::computeBackedgeTakenCount(llvm::Loop const*, bool) ()
#9  0x0000000002987134 in llvm::ScalarEvolution::getBackedgeTakenInfo(llvm::Loop const*) ()
#10 0x0000000002980992 in llvm::ScalarEvolution::getRangeRef(llvm::SCEV const*, llvm::ScalarEvolution::RangeSignHint) ()
#11 0x0000000002971779 in StrengthenNoWrapFlags(llvm::ScalarEvolution*, llvm::SCEVTypes, llvm::ArrayRef<llvm::SCEV const*>, llvm::SCEV::NoWrapFlags) ()
#12 0x00000000029640d0 in llvm::ScalarEvolution::getAddExpr(llvm::SmallVectorImpl<llvm::SCEV const*>&, llvm::SCEV::NoWrapFlags, unsigned int) ()
#13 0x0000000002976af9 in llvm::ScalarEvolution::createSCEV(llvm::Value*) ()
#14 0x000000000296ecd9 in llvm::ScalarEvolution::getSCEV(llvm::Value*) ()
#15 0x0000000003b1d1a1 in llvm::simplifyUsersOfIV(llvm::PHINode*, llvm::ScalarEvolution*, llvm::DominatorTree*, llvm::LoopInfo*, llvm::TargetTransformInfo const*, llvm::SmallVectorImpl<llvm::WeakTrackingVH>&, llvm::SCEVExpander&, llvm::IVVisitor*) ()
#16 0x00000000036cff3d in (anonymous namespace)::IndVarSimplify::run(llvm::Loop*) ()

so llvm::ScalarEvolution::howManyLessThans can end up inserting divisions.

preames · 2022-06-21T22:38:54Z

Having a backedge expression result in a udiv is pretty fundamental. That's the core goal of SCEV's design.

We could potentially exploring making RLEV cost based, but we've talked about that before, and we have pretty solid reasons for wanting to treat it as a canonical form.

On the target which doesn't have udiv, can you give a feeling for the relative cost of the libcall and loop? Are we talking small constant factor differences or much more extreme? I'm basically wondering if this is a costing question, or a legality one in disguise.

efriedma-quic · 2022-06-22T20:06:13Z

The compiler-rt routines for division are not very well optimized; we're probably talking about hundreds of cycles if we're using the 64-bit target-independent divide routine on a 32-bit target. We could probably do better, but still pretty expensive in any case. For x86 specifically, we do have an optimized assembly routine for __udivdi3, though.

We also don't have any compiler optimizations for 64-bit divide by constant on 32-bit targets; we could probably emit inline code sequences that are a lot faster than the generic routine in many cases. (For division by 12, Arm gcc emits an inline sequence of roughly 20 instructions.)

But given the code sequences the compiler currently emits, we really don't want indvars to emit division in any case where it would be lowered to a libcall (not power of two, no legal instruction, can't use divide->multiply transform).

nickdesaulniers · 2022-07-15T18:30:11Z

This is becoming a more pressing issue for us, particularly after 3bc09c7 as it's now going to start blocking our CI. cc @nathanchance

nickdesaulniers · 2022-07-15T22:36:33Z

The compiler-rt routines for division are not very well optimized; we're probably talking about hundreds of cycles if we're using the 64-bit target-independent divide routine on a 32-bit target.

Specifically for 32b arm, 64b division as implemented by __aeabi_uldivmod is mostly a call to __udivmoddi4 (which is huge).

nickdesaulniers · 2022-07-18T18:38:29Z

we really don't want indvars to emit division in any case where it would be lowered to a libcall (not power of two, no legal instruction, can't use divide->multiply transform).

Any suggestion on existing machinery to check for this?

SCEVExpander::isHighCostExpansion might be a good place to do this check, but LoopUtils::rewriteLoopExitValues() will still override the results of Phi.HighCost if LoopCanBeDel...

cc @sjoerdmeijer

nikic · 2022-07-18T20:40:32Z

What kind of cost does the udiv get for this target? Possibly we can make LoopCanBeDel still use HighCost and just increase the budget by a lot? The normal budget is 4.

preames · 2022-07-18T21:42:27Z

Nick and I chatted a bit offline, here is my summary.

Motivation here is that kernel essentially has a dialect of C which disallows 64 bit divides on 32 bit platforms. There is a kernel function which provides a 64 bit divide if needed, but the kernel expects the compiler not to insert calls to runtime routines for this purpose. Nick has a patch which implemented the runtime function in terms of the kernel function; something about that didn't work, I didn't entirely grasp what.

RLEV is a canonicalization. Turning it off impacts other transformations like e.g. LoopVectorizer. We need to be careful to make sure that any negative impact is restrict to kernel code, and not generic user land code for the platform.

This concern makes it hard to use the cost model approach. We can introduce a high cost threshold for the dead loop case, but how to we arrange for the udiv to have infinite cost in kernel code only? We don't currently have a good answer to that.

I suggested two near term approaches. Both of these essentially punt on the root issue, and instead focus on simply getting rid of these particular udivs.

Explore whether we have a peephole which could replace the udiv idiom on these examples. The udiv_ceiling reasoning - which is where the min/max/udiv patterns are coming from - is fairly recent, so we may simply have a missing peephole.
See if we can get the relevant 32-bit backends to expand the arithmetic sequences (rather than libcall) in a generally profitable way. This may run into the same "how do we penalize only kernel code" question as above, but we might get lucky and be able to find a generally profitable subcase which covers all of the practical examples.

nickdesaulniers · 2022-07-18T22:14:10Z

We also don't have any compiler optimizations for 64-bit divide by constant on 32-bit targets; we could probably emit inline code sequences that are a lot faster than the generic routine in many cases. (For division by 12, Arm gcc emits an inline sequence of roughly 20 instructions.)

See if we can get the relevant 32-bit backends to expand the arithmetic sequences (rather than libcall) in a generally profitable way. This may run into the same "how do we penalize only kernel code" question as above, but we might get lucky and be able to find a generally profitable subcase which covers all of the practical examples.

Right, seems like we might be missing this.

efriedma-quic · 2022-07-18T22:33:08Z

We're specifically missing the "divide udword by dword" part, yes.

I think https://gmplib.org/~tege/division-paper.pdf is the current state of the art. See also https://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html , https://0x414b.com/2021/04/16/arm-division.html .

nickdesaulniers · 2022-07-18T22:38:55Z

Is DAGTypeLegalizer::ExpandIntRes_UDIV the right place to modify to improve division by constant?

efriedma-quic · 2022-07-18T22:53:59Z

The existing code for handling unsigned divide by constant is in TargetLowering::BuildUDIV.

topperc · 2022-08-01T17:49:53Z

I started implementing something like what gcc does here https://reviews.llvm.org/D130862

nickdesaulniers · 2022-08-01T18:52:19Z

From a discussion with @topperc on discourse:

do you know what all constants linux needs so I can prioritize? I see the PR uses 12

ClangBuiltLinux/linux#1635 is urem/udiv on arm32 and i386 of 64b with constant 12.
ClangBuiltLinux/linux#1679 is urem/udiv on ppc32 of 64b with constant 5.
ClangBuiltLinux/linux#1666 is udiv on arm32 of 64b with a non-constant/non-immediate.

nikic · 2023-03-09T16:18:41Z

I think the backend improvements for division by constants have landed -- can this issue be closed now?

nickdesaulniers added backend:ARM backend:X86 llvm:optimizations llvm:SCEV Scalar Evolution labels Jun 21, 2022

nickdesaulniers mentioned this issue Jun 21, 2022

__aeabi_uldivmod in drivers/mtd/parsers/scpart.o ClangBuiltLinux/linux#1635

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unprofitable SCEV/INDVARS transform #56153

unprofitable SCEV/INDVARS transform #56153

nickdesaulniers commented Jun 21, 2022

llvmbot commented Jun 21, 2022

llvmbot commented Jun 21, 2022

nickdesaulniers commented Jun 21, 2022

preames commented Jun 21, 2022

efriedma-quic commented Jun 22, 2022

nickdesaulniers commented Jul 15, 2022 •

edited

Loading

nickdesaulniers commented Jul 15, 2022

nickdesaulniers commented Jul 18, 2022 •

edited

Loading

nikic commented Jul 18, 2022

preames commented Jul 18, 2022

nickdesaulniers commented Jul 18, 2022 •

edited

Loading

efriedma-quic commented Jul 18, 2022

nickdesaulniers commented Jul 18, 2022

efriedma-quic commented Jul 18, 2022

topperc commented Aug 1, 2022

nickdesaulniers commented Aug 1, 2022

nikic commented Mar 9, 2023

unprofitable SCEV/INDVARS transform #56153

unprofitable SCEV/INDVARS transform #56153

Comments

nickdesaulniers commented Jun 21, 2022

llvmbot commented Jun 21, 2022

llvmbot commented Jun 21, 2022

nickdesaulniers commented Jun 21, 2022

preames commented Jun 21, 2022

efriedma-quic commented Jun 22, 2022

nickdesaulniers commented Jul 15, 2022 • edited Loading

nickdesaulniers commented Jul 15, 2022

nickdesaulniers commented Jul 18, 2022 • edited Loading

nikic commented Jul 18, 2022

preames commented Jul 18, 2022

nickdesaulniers commented Jul 18, 2022 • edited Loading

efriedma-quic commented Jul 18, 2022

nickdesaulniers commented Jul 18, 2022

efriedma-quic commented Jul 18, 2022

topperc commented Aug 1, 2022

nickdesaulniers commented Aug 1, 2022

nikic commented Mar 9, 2023

nickdesaulniers commented Jul 15, 2022 •

edited

Loading

nickdesaulniers commented Jul 18, 2022 •

edited

Loading

nickdesaulniers commented Jul 18, 2022 •

edited

Loading