Add AggressiveInlining to a couple of Vector.Create methods #71889

EgorBo · 2022-07-09T20:19:44Z

Fixes #71885
These methods aren't intrinsics and when the more conservative inliner is used (for AOT in non --Ot mode, which is default) it fails to inline them.

static Vector128<byte> CreateWithROSpan() => Vector128.Create("0123456789ABCDEF"u8);

Was:

; Assembly listing for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ReadyToRun compilation
; optimized code
G_M10759_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       33C0                 xor      eax, eax
       4889442420           mov      qword ptr [rsp+20H], rax
       488BF1               mov      rsi, rcx
G_M10759_IG02:              ;; offset=000FH
       488D0D00000000       lea      rcx, [(reloc 0x4000000000420060)]
       48894C2420           mov      bword ptr [rsp+20H], rcx
       C744242810000000     mov      dword ptr [rsp+28H], 16
       488BCE               mov      rcx, rsi
       488D542420           lea      rdx, [rsp+20H]
       FF1500000000         call     [System.Runtime.Intrinsics.Vector128:Create(System.ReadOnlySpan`1[System.Byte]):System.Runtime.Intrinsics.Vector128`1[System.Byte]]
       488BC6               mov      rax, rsi
G_M10759_IG03:              ;; offset=0034H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret

; Total bytes of code 58, prolog size 12, PerfScore 16.80, instruction count 15, allocated bytes for code 58 (MethodHash=107ed5f8) for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; ============================================================

Now:

; Assembly listing for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ReadyToRun compilation
; optimized code
G_M10759_IG01:              ;; offset=0000H
       C5F877               vzeroupper
G_M10759_IG02:              ;; offset=0003H
       488D0500000000       lea      rax, [(reloc 0x4000000000420060)]
       C5F91000             vmovupd  xmm0, xmmword ptr [rax]
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
G_M10759_IG03:              ;; offset=0015H
       C3                   ret

; Total bytes of code 22, prolog size 3, PerfScore 11.15, instruction count 6, allocated bytes for code 24 (MethodHash=107ed5f8) for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; ============================================================

cc @tannergooding

ghost · 2022-07-09T20:20:03Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #71885
These methods aren't intrinsics and when the more conservative inliner is used (for AOT in non --Ot mode, which is default) it fails to inline them.

static Vector128<byte> CreateWithROSpan() => Vector128.Create("0123456789ABCDEF"u8);

Was:

; Assembly listing for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ReadyToRun compilation
; optimized code

G_M10759_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       33C0                 xor      eax, eax
       4889442420           mov      qword ptr [rsp+20H], rax
       488BF1               mov      rsi, rcx
                                                ;; size=15 bbWeight=1    PerfScore 2.75
G_M10759_IG02:              ;; offset=000FH
       488D0D00000000       lea      rcx, [(reloc 0x4000000000420060)]
       48894C2420           mov      bword ptr [rsp+20H], rcx
       C744242810000000     mov      dword ptr [rsp+28H], 16
       488BCE               mov      rcx, rsi
       488D542420           lea      rdx, [rsp+20H]
       FF1500000000         call     [System.Runtime.Intrinsics.Vector128:Create(System.ReadOnlySpan`1[System.Byte]):System.Runtime.Intrinsics.Vector128`1[System.Byte]]
       488BC6               mov      rax, rsi
                                                ;; size=37 bbWeight=1    PerfScore 6.50
G_M10759_IG03:              ;; offset=0034H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret
                                                ;; size=6 bbWeight=1    PerfScore 1.75

; Total bytes of code 58, prolog size 12, PerfScore 16.80, instruction count 15, allocated bytes for code 58 (MethodHash=107ed5f8) for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; ============================================================

Now:

cc @tannergooding

Author:	EgorBo
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`
Milestone:	-

am11 · 2022-07-09T20:27:02Z

Thanks for the quick fix! While at it, feel free to update HexConverter.cs#L99-L107 to use ""u8 if you want. 😉

tannergooding · 2022-07-09T20:27:10Z

We may want to look at making them intrinsic in .NET 8. The "equivalents" are intrinsic for System.Numerics.Vector<T>

EgorBo · 2022-07-09T20:33:55Z

We may want to look at making them intrinsic in .NET 8. The "equivalents" are intrinsic for System.Numerics.Vector<T>

@tannergooding Are you sure those are intrinsics for Vector<T>? I don't see it for:

private static ReadOnlySpan<byte> data => new []
{
    (byte)'0', (byte)'1', (byte)'2', (byte)'3',
    (byte)'4', (byte)'5', (byte)'6', (byte)'7',
    (byte)'8', (byte)'9', (byte)'A', (byte)'B',
    (byte)'C', (byte)'D', (byte)'E', (byte)'F',
    (byte)'0', (byte)'1', (byte)'2', (byte)'3',
    (byte)'4', (byte)'5', (byte)'6', (byte)'7',
    (byte)'8', (byte)'9', (byte)'A', (byte)'B',
    (byte)'C', (byte)'D', (byte)'E', (byte)'F',
};

static Vector<byte> CreateWithROSpan() => new Vector<byte>(data);

dump: https://gist.github.com/EgorBo/564520f9abfa8bc6da5103dc3a6606c9

Thanks for the quick fix! While at it, feel free to update HexConverter.cs#L99-L107 to use ""u8 if you want. 😉

leaving it for you 🙂

EgorBo · 2022-07-09T20:35:47Z

Ah, probably the T[] array overloads. ReadOnlySpan<T> going to be difficult to handle - it emits quite a verbose IR in JIT 🙁

am11 · 2022-07-10T00:18:26Z

for AOT in non --Ot mode, which is default

The optimization is certainly turned on for Release configuration:

runtime/src/coreclr/nativeaot/BuildIntegration/BuildFrameworkNativeObjects.proj

Line 9 in 4591820

    
           <Optimize Condition="'$(Configuration)' == 'Release' and '$(Optimize)' == ''">true</Optimize>

(-O is passed to ilc, which can be seen in VectorCreation/obj/Release/net7.0/linux-x64/native/VectorCreation.ilc.rsp from repro I posted)

EgorBo · 2022-07-10T00:31:07Z

for AOT in non --Ot mode, which is default

The optimization is certainly turned on for Release configuration:

runtime/src/coreclr/nativeaot/BuildIntegration/BuildFrameworkNativeObjects.proj

Line 9 in 4591820

<Optimize Condition="'$(Configuration)' == 'Release' and '$(Optimize)' == ''">true</Optimize>

(-O is passed to ilc, which can be seen in VectorCreation/obj/Release/net7.0/linux-x64/native/VectorCreation.ilc.rsp from repro I posted)

Yes, but there is a difference between -O and --Ot. -O is a sort of default optimization mode (aka blended) while --Ot is "prefer speed". --Ot uses a more aggressive inliner (hence, bigger binary size).

Same applies to crossgen2

am11 · 2022-07-10T00:42:29Z

Ah nice, I missed that. With -p:IlcOptimizationPreference=Speed (that maps to --Ot), it gives:

Dump of assembler code for function VectorCreation_D__CreateWithROSpan:
   0x0000000000148410 <+0>:	push   rbp
   0x0000000000148411 <+1>:	mov    rbp,rsp
   0x0000000000148414 <+4>:	lea    rax,[rip+0x1e8f15]        # 0x331330
   0x000000000014841b <+11>:	movups xmm0,XMMWORD PTR [rax]
   0x000000000014841e <+14>:	movups XMMWORD PTR [rsi],xmm0
   0x0000000000148421 <+17>:	mov    rax,rsi
   0x0000000000148424 <+20>:	pop    rbp
   0x0000000000148425 <+21>:	ret    
End of assembler dump.

but the byte, byte, byte ... x 16 overload still has a better codegen. (I initially thought that we have params byte[] overload but I was wrong; we have specialized one for 16 bytes args).

EgorBo · 2022-07-10T01:11:05Z

but the byte, byte, byte ... x 16 overload still has a better codegen.

The difference is that ROS<> is a static RVA -> it can be a single storage for vectors in different methods while the byte-byte-byte.. overload will emit the data per-method in the .data section - it's easier to access from both jit and aot - just a single load instruction with RIP-addressing. So, probably, your suggested change will always be slightly slower for AOT/CG2

dotnet-issue-labeler bot added the area-System.Runtime.Intrinsics label Jul 9, 2022

ghost assigned EgorBo Jul 9, 2022

Add AggressiveInlining to a couple of Vector.Create methods

4c6f32f

EgorBo force-pushed the add-aggr-inli branch from b8b1b01 to 4c6f32f Compare July 9, 2022 20:22

tannergooding approved these changes Jul 9, 2022

View reviewed changes

stephentoub approved these changes Jul 9, 2022

View reviewed changes

stephentoub merged commit 25216ae into dotnet:main Jul 10, 2022

EgorBo deleted the add-aggr-inli branch July 10, 2022 01:11

runfoapp bot mentioned this pull request Jul 12, 2022

system.net.security.tests work item timeout on mono #71997

Closed

kunalspathak mentioned this pull request Jul 22, 2022

[Perf] Regressions in System.Buffers.Tests.ReadOnlySequenceTests<Char> #72025

Closed

ghost locked as resolved and limited conversation to collaborators Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AggressiveInlining to a couple of Vector.Create methods #71889

Add AggressiveInlining to a couple of Vector.Create methods #71889

EgorBo commented Jul 9, 2022 •

edited

Loading

ghost commented Jul 9, 2022

am11 commented Jul 9, 2022

tannergooding commented Jul 9, 2022

EgorBo commented Jul 9, 2022 •

edited

Loading

EgorBo commented Jul 9, 2022

am11 commented Jul 10, 2022

EgorBo commented Jul 10, 2022 •

edited

Loading

am11 commented Jul 10, 2022 •

edited

Loading

EgorBo commented Jul 10, 2022

Add AggressiveInlining to a couple of Vector.Create methods #71889

Add AggressiveInlining to a couple of Vector.Create methods #71889

Conversation

EgorBo commented Jul 9, 2022 • edited Loading

ghost commented Jul 9, 2022

am11 commented Jul 9, 2022

tannergooding commented Jul 9, 2022

EgorBo commented Jul 9, 2022 • edited Loading

EgorBo commented Jul 9, 2022

am11 commented Jul 10, 2022

EgorBo commented Jul 10, 2022 • edited Loading

am11 commented Jul 10, 2022 • edited Loading

EgorBo commented Jul 10, 2022

EgorBo commented Jul 9, 2022 •

edited

Loading

EgorBo commented Jul 9, 2022 •

edited

Loading

EgorBo commented Jul 10, 2022 •

edited

Loading

am11 commented Jul 10, 2022 •

edited

Loading