force inline small double/int methods, especially if parameters require boxing #40852

mraleph · 2020-03-02T21:38:50Z

When looking at native code produced for scaleRadii from Flutter dart:ui, I see us emitting code like this:

      v3 <- Constant(#1.0) T{_Double}
...
      v420 <- Constant(#0) [0, 0] T{_Smi}
...
66: B155[target]:116
 68:     PushArgument(v3)
 70:     PushArgument(v420)
 72:     v400 <- StaticCall:126( ==<0> v3, v420, recognized_kind = Double_equal, result_type = T{bool}) T{bool}
 73:     ParallelMove r0 <- r0
 74:     Branch if StrictCompare:130(===, v400, v38) goto (156, 158)

and

78:     v603 <- Box(v220) T{_Double}
 80:     PushArgument(v603)
 82:     v404 <- StaticCall:134( get:isNegative<0> v603, recognized_kind = Double_getIsNegative, result_type = T{bool}) T{bool}
 83:     ParallelMove r0 <- r0
 84:     Branch if StrictCompare:138(===, v404, v38) goto (161, 159)

This generates more code then we would get by ensuring that methods isNegative and == are inlined.

We should investigate why and how often it happens and ensure that they are inlined.

/cc @mkustermann @alexmarkov @sjindel-google

The text was updated successfully, but these errors were encountered:

mraleph · 2020-03-25T14:13:30Z

@sstrickl I think this might be a good place to start your inlining investigation project.

sstrickl · 2020-04-17T10:04:13Z

So looking at the example methods you mentioned, they're both intrinsics, and _Double.== is marked as never inline. Currently, our inliner bails out on intrinsics unless it's marked as always inline, and even if I added @pragma('vm:prefer-inline') to Double.getNegative, there's a InlineBailout in NativeCall in kernel_to_il.cc which will keep it from inlining (and the same would hold if I changed never to prefer in _Double.==).

So I'm investigating the whys of all of these decisions now, to decide whether there's anything that can be done to loosen them in a way that might allow us to inline these and similarly gatekept functions. I'd also appreciate any historical knowledge you and other long-time VM developers have about these decisions. (/cc @mkustermann )

mraleph · 2020-04-17T12:38:22Z

Yeah, I think this is another example of poor architecture in our compilation pipeline:

We have intrinsified functions which are never inlined by the compiler, even though if we were to inline them we would get a smaller code. We often handle this by providing hand written IL graphs in either call specializer or inliner or graph builder (sometimes we have duplicated code due to that).

I think it would be good to have some uniformity here - have a single piece of code that defines IL graph, which can be used either to produce normal function body (e.g. to use as intrinsic) and can be inlined / used by the call specialiser.

mraleph · 2022-05-10T09:09:54Z

We should definitely review possible approaches here. Currently we are unable to unbox method receivers which is causing us some code quality issues on methods of double and int. When I look at the graph of BoxConstraints.enforce for example I see the following:

Graph of package:flutter/src/rendering/box.dart_BoxConstraints_enforce

After AllocateRegisters
==== package:flutter/src/rendering/box.dart_BoxConstraints_enforce (RegularFunction)
  0: B0[graph]:0 {
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{BoxConstraints}
      v3 <- Parameter(1) T{BoxConstraints}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove r0 <- S+3
  6:     v6 <- LoadField(v2 . minWidth {final}) T{_Double}
  7:     ParallelMove r1 <- S+2
  8:     v8 <- LoadField(v3 . minWidth {final}) T{_Double}
 10:     v10 <- LoadField(v3 . maxWidth {final}) T{_Double}
 12:     v58 <- Box(v6) T{_Double}
 14:     v60 <- Box(v8) T{_Double}
 15:     ParallelMove S-2 <- r3
 16:     v62 <- Box(v10) T{_Double}
 17:     ParallelMove S-1 <- r4
 18:     PushArgument(v58)
 20:     PushArgument(v60)
 22:     PushArgument(v62)
 24:     v12 <- StaticCall:18( clamp<0> v58, v60, v62, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 25:     ParallelMove r1 <- r0, r0 <- S+3
 26:     ParallelMove S-3 <- r1
 26:     v14 <- LoadField(v2 . maxWidth {final}) T{_Double}
 28:     v66 <- Box(v14) T{_Double}
 30:     PushArgument(v66)
 32:     PushArgument(v60)
 34:     PushArgument(v62)
 36:     v20 <- StaticCall:26( clamp<0> v66, v60, v62, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 37:     ParallelMove r1 <- r0, r0 <- S+3
 38:     ParallelMove S-4 <- r1
 38:     v22 <- LoadField(v2 . minHeight {final}) T{_Double}
 39:     ParallelMove r2 <- S+2
 40:     v24 <- LoadField(v3 . minHeight {final}) T{_Double}
 42:     v26 <- LoadField(v3 . maxHeight {final}) T{_Double}
 44:     v74 <- Box(v22) T{_Double}
 46:     v76 <- Box(v24) T{_Double}
 47:     ParallelMove S-2 <- r3
 48:     v78 <- Box(v26) T{_Double}
 49:     ParallelMove S-1 <- r4
 50:     PushArgument(v74)
 52:     PushArgument(v76)
 54:     PushArgument(v78)
 56:     v28 <- StaticCall:34( clamp<0> v74, v76, v78, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 57:     ParallelMove r1 <- r0, r0 <- S+3
 58:     ParallelMove S-5 <- r1
 58:     v30 <- LoadField(v2 . maxHeight {final}) T{_Double}
 60:     v82 <- Box(v30) T{_Double}
 62:     PushArgument(v82)
 64:     PushArgument(v76)
 66:     PushArgument(v78)
 68:     v36 <- StaticCall:42( clamp<0> v82, v76, v78, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 69:     ParallelMove r1 <- r0, r0 <- S-3
 70:     ParallelMove S-1 <- r1
 70:     v64 <- Unbox(v12) T{_Double}
 71:     ParallelMove DS-6 <- v0
 72:     v4 <- AllocateObject:10(cls=BoxConstraints, <not-aliased>) T{BoxConstraints}
 73:     ParallelMove r0 <- r0, v0 <- DS-6
 74:     StoreInstanceField(v4 . minWidth = v64, NoStoreBarrier)
 75:     ParallelMove r1 <- S-4
 76:     v72 <- Unbox(v20) T{_Double}
 78:     StoreInstanceField(v4 . maxWidth = v72, NoStoreBarrier)
 79:     ParallelMove r1 <- S-5
 80:     v80 <- Unbox(v28) T{_Double}
 82:     StoreInstanceField(v4 . minHeight = v80, NoStoreBarrier)
 83:     ParallelMove r1 <- S-1
 84:     v88 <- Unbox(v36) T{_Double}
 86:     StoreInstanceField(v4 . maxHeight = v88, NoStoreBarrier)
 87:     ParallelMove r0 <- r0
 88:     Return:46(v4)
*** END CFG

Note that all BoxConstraints fields are unboxed which causes them to be reboxed just to be passed down to statically invoked double.clamp. This is a lot of overhead (the implementation of double.clamp itself is potentially slow but that's a separate issue: #46879).

I have written a microbenchmark which tries to establish the cost, and the boxing is probably contributing 100 ps per invocation (on a slower Android ARM32 device):

I/flutter (18964): double.clamp: 318.80 ps
I/flutter (18964): naive clamp: 11.55 ps
I/flutter (18964): naive clamp (arg check): 20.36 ps
I/flutter (18964): naive clamp (boxed): 98.43 ps
I/flutter (18964): naive clamp (simd): 3.32 ps
I/flutter (18964): naive clamp (simd array): 6.13 ps

On a ARM64 device (faster, but downscaled to 1132800):

I/flutter (12151): double.clamp: 197.60 ps
I/flutter (12151): naive clamp: 7.38 ps
I/flutter (12151): naive clamp (arg check): 13.85 ps
I/flutter (12151): naive clamp (boxed): 94.33 ps
I/flutter (12151): naive clamp (simd): 3.28 ps
I/flutter (12151): naive clamp (simd array): 3.92 ps

We might want to apply some sort of worker-wrapper approach here, e.g. split methods on double and int into unboxed versions which do the work and wrappers which handle unboxing/boxing whenever possible.

Of course the original message of this issue still stands: all small functions need to be inlined.

gaaclarke · 2022-05-18T21:53:37Z

FYI I moved Flutter to its own implement of clamp: flutter/flutter#103559

mraleph added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size vm-aot-code-size Related to improvements in AOT code size labels Mar 2, 2020

mraleph assigned sstrickl Mar 25, 2020

mraleph added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. and removed area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. labels May 6, 2020

mraleph mentioned this issue May 24, 2022

switched to a double variant of clamp to avoid boxing flutter/flutter#103559

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

force inline small double/int methods, especially if parameters require boxing #40852

force inline small double/int methods, especially if parameters require boxing #40852

mraleph commented Mar 2, 2020

mraleph commented Mar 25, 2020

sstrickl commented Apr 17, 2020

mraleph commented Apr 17, 2020

mraleph commented May 10, 2022 •

edited

Loading

gaaclarke commented May 18, 2022

force inline small double/int methods, especially if parameters require boxing #40852

force inline small double/int methods, especially if parameters require boxing #40852

Comments

mraleph commented Mar 2, 2020

mraleph commented Mar 25, 2020

sstrickl commented Apr 17, 2020

mraleph commented Apr 17, 2020

mraleph commented May 10, 2022 • edited Loading

gaaclarke commented May 18, 2022

mraleph commented May 10, 2022 •

edited

Loading