Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

force inline small double/int methods, especially if parameters require boxing #40852

Open
mraleph opened this issue Mar 2, 2020 · 5 comments
Assignees
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size vm-aot-code-size Related to improvements in AOT code size

Comments

@mraleph
Copy link
Member

mraleph commented Mar 2, 2020

When looking at native code produced for scaleRadii from Flutter dart:ui, I see us emitting code like this:

      v3 <- Constant(#1.0) T{_Double}
...
      v420 <- Constant(#0) [0, 0] T{_Smi}
...
66: B155[target]:116
 68:     PushArgument(v3)
 70:     PushArgument(v420)
 72:     v400 <- StaticCall:126( ==<0> v3, v420, recognized_kind = Double_equal, result_type = T{bool}) T{bool}
 73:     ParallelMove r0 <- r0
 74:     Branch if StrictCompare:130(===, v400, v38) goto (156, 158)

and

78:     v603 <- Box(v220) T{_Double}
 80:     PushArgument(v603)
 82:     v404 <- StaticCall:134( get:isNegative<0> v603, recognized_kind = Double_getIsNegative, result_type = T{bool}) T{bool}
 83:     ParallelMove r0 <- r0
 84:     Branch if StrictCompare:138(===, v404, v38) goto (161, 159)

This generates more code then we would get by ensuring that methods isNegative and == are inlined.

We should investigate why and how often it happens and ensure that they are inlined.

/cc @mkustermann @alexmarkov @sjindel-google

@mraleph mraleph added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size vm-aot-code-size Related to improvements in AOT code size labels Mar 2, 2020
@mraleph
Copy link
Member Author

mraleph commented Mar 25, 2020

@sstrickl I think this might be a good place to start your inlining investigation project.

@sstrickl
Copy link
Contributor

So looking at the example methods you mentioned, they're both intrinsics, and _Double.== is marked as never inline. Currently, our inliner bails out on intrinsics unless it's marked as always inline, and even if I added @pragma('vm:prefer-inline') to Double.getNegative, there's a InlineBailout in NativeCall in kernel_to_il.cc which will keep it from inlining (and the same would hold if I changed never to prefer in _Double.==).

So I'm investigating the whys of all of these decisions now, to decide whether there's anything that can be done to loosen them in a way that might allow us to inline these and similarly gatekept functions. I'd also appreciate any historical knowledge you and other long-time VM developers have about these decisions. (/cc @mkustermann )

@mraleph
Copy link
Member Author

mraleph commented Apr 17, 2020

Yeah, I think this is another example of poor architecture in our compilation pipeline:

We have intrinsified functions which are never inlined by the compiler, even though if we were to inline them we would get a smaller code. We often handle this by providing hand written IL graphs in either call specializer or inliner or graph builder (sometimes we have duplicated code due to that).

I think it would be good to have some uniformity here - have a single piece of code that defines IL graph, which can be used either to produce normal function body (e.g. to use as intrinsic) and can be inlined / used by the call specialiser.

@mraleph mraleph added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. and removed area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. labels May 6, 2020
@mraleph
Copy link
Member Author

mraleph commented May 10, 2022

We should definitely review possible approaches here. Currently we are unable to unbox method receivers which is causing us some code quality issues on methods of double and int. When I look at the graph of BoxConstraints.enforce for example I see the following:

Graph of package:flutter/src/rendering/box.dart_BoxConstraints_enforce
After AllocateRegisters
==== package:flutter/src/rendering/box.dart_BoxConstraints_enforce (RegularFunction)
  0: B0[graph]:0 {
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{BoxConstraints}
      v3 <- Parameter(1) T{BoxConstraints}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove r0 <- S+3
  6:     v6 <- LoadField(v2 . minWidth {final}) T{_Double}
  7:     ParallelMove r1 <- S+2
  8:     v8 <- LoadField(v3 . minWidth {final}) T{_Double}
 10:     v10 <- LoadField(v3 . maxWidth {final}) T{_Double}
 12:     v58 <- Box(v6) T{_Double}
 14:     v60 <- Box(v8) T{_Double}
 15:     ParallelMove S-2 <- r3
 16:     v62 <- Box(v10) T{_Double}
 17:     ParallelMove S-1 <- r4
 18:     PushArgument(v58)
 20:     PushArgument(v60)
 22:     PushArgument(v62)
 24:     v12 <- StaticCall:18( clamp<0> v58, v60, v62, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 25:     ParallelMove r1 <- r0, r0 <- S+3
 26:     ParallelMove S-3 <- r1
 26:     v14 <- LoadField(v2 . maxWidth {final}) T{_Double}
 28:     v66 <- Box(v14) T{_Double}
 30:     PushArgument(v66)
 32:     PushArgument(v60)
 34:     PushArgument(v62)
 36:     v20 <- StaticCall:26( clamp<0> v66, v60, v62, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 37:     ParallelMove r1 <- r0, r0 <- S+3
 38:     ParallelMove S-4 <- r1
 38:     v22 <- LoadField(v2 . minHeight {final}) T{_Double}
 39:     ParallelMove r2 <- S+2
 40:     v24 <- LoadField(v3 . minHeight {final}) T{_Double}
 42:     v26 <- LoadField(v3 . maxHeight {final}) T{_Double}
 44:     v74 <- Box(v22) T{_Double}
 46:     v76 <- Box(v24) T{_Double}
 47:     ParallelMove S-2 <- r3
 48:     v78 <- Box(v26) T{_Double}
 49:     ParallelMove S-1 <- r4
 50:     PushArgument(v74)
 52:     PushArgument(v76)
 54:     PushArgument(v78)
 56:     v28 <- StaticCall:34( clamp<0> v74, v76, v78, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 57:     ParallelMove r1 <- r0, r0 <- S+3
 58:     ParallelMove S-5 <- r1
 58:     v30 <- LoadField(v2 . maxHeight {final}) T{_Double}
 60:     v82 <- Box(v30) T{_Double}
 62:     PushArgument(v82)
 64:     PushArgument(v76)
 66:     PushArgument(v78)
 68:     v36 <- StaticCall:42( clamp<0> v82, v76, v78, using unchecked entrypoint, result_type = T{_Double}) T{_Double}
 69:     ParallelMove r1 <- r0, r0 <- S-3
 70:     ParallelMove S-1 <- r1
 70:     v64 <- Unbox(v12) T{_Double}
 71:     ParallelMove DS-6 <- v0
 72:     v4 <- AllocateObject:10(cls=BoxConstraints, <not-aliased>) T{BoxConstraints}
 73:     ParallelMove r0 <- r0, v0 <- DS-6
 74:     StoreInstanceField(v4 . minWidth = v64, NoStoreBarrier)
 75:     ParallelMove r1 <- S-4
 76:     v72 <- Unbox(v20) T{_Double}
 78:     StoreInstanceField(v4 . maxWidth = v72, NoStoreBarrier)
 79:     ParallelMove r1 <- S-5
 80:     v80 <- Unbox(v28) T{_Double}
 82:     StoreInstanceField(v4 . minHeight = v80, NoStoreBarrier)
 83:     ParallelMove r1 <- S-1
 84:     v88 <- Unbox(v36) T{_Double}
 86:     StoreInstanceField(v4 . maxHeight = v88, NoStoreBarrier)
 87:     ParallelMove r0 <- r0
 88:     Return:46(v4)
*** END CFG

Note that all BoxConstraints fields are unboxed which causes them to be reboxed just to be passed down to statically invoked double.clamp. This is a lot of overhead (the implementation of double.clamp itself is potentially slow but that's a separate issue: #46879).

I have written a microbenchmark which tries to establish the cost, and the boxing is probably contributing 100 ps per invocation (on a slower Android ARM32 device):

I/flutter (18964): double.clamp: 318.80 ps
I/flutter (18964): naive clamp: 11.55 ps
I/flutter (18964): naive clamp (arg check): 20.36 ps
I/flutter (18964): naive clamp (boxed): 98.43 ps
I/flutter (18964): naive clamp (simd): 3.32 ps
I/flutter (18964): naive clamp (simd array): 6.13 ps

On a ARM64 device (faster, but downscaled to 1132800):

I/flutter (12151): double.clamp: 197.60 ps
I/flutter (12151): naive clamp: 7.38 ps
I/flutter (12151): naive clamp (arg check): 13.85 ps
I/flutter (12151): naive clamp (boxed): 94.33 ps
I/flutter (12151): naive clamp (simd): 3.28 ps
I/flutter (12151): naive clamp (simd array): 3.92 ps

We might want to apply some sort of worker-wrapper approach here, e.g. split methods on double and int into unboxed versions which do the work and wrappers which handle unboxing/boxing whenever possible.

Of course the original message of this issue still stands: all small functions need to be inlined.

@gaaclarke
Copy link
Contributor

FYI I moved Flutter to its own implement of clamp: flutter/flutter#103559

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size vm-aot-code-size Related to improvements in AOT code size
Projects
None yet
Development

No branches or pull requests

3 participants