Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AggressiveInlining to a couple of Vector.Create methods #71889

Merged
merged 1 commit into from
Jul 10, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jul 9, 2022

Fixes #71885
These methods aren't intrinsics and when the more conservative inliner is used (for AOT in non --Ot mode, which is default) it fails to inline them.

static Vector128<byte> CreateWithROSpan() => Vector128.Create("0123456789ABCDEF"u8);

Was:

; Assembly listing for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ReadyToRun compilation
; optimized code
G_M10759_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       33C0                 xor      eax, eax
       4889442420           mov      qword ptr [rsp+20H], rax
       488BF1               mov      rsi, rcx
G_M10759_IG02:              ;; offset=000FH
       488D0D00000000       lea      rcx, [(reloc 0x4000000000420060)]
       48894C2420           mov      bword ptr [rsp+20H], rcx
       C744242810000000     mov      dword ptr [rsp+28H], 16
       488BCE               mov      rcx, rsi
       488D542420           lea      rdx, [rsp+20H]
       FF1500000000         call     [System.Runtime.Intrinsics.Vector128:Create(System.ReadOnlySpan`1[System.Byte]):System.Runtime.Intrinsics.Vector128`1[System.Byte]]
       488BC6               mov      rax, rsi
G_M10759_IG03:              ;; offset=0034H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret

; Total bytes of code 58, prolog size 12, PerfScore 16.80, instruction count 15, allocated bytes for code 58 (MethodHash=107ed5f8) for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; ============================================================

Now:

; Assembly listing for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ReadyToRun compilation
; optimized code
G_M10759_IG01:              ;; offset=0000H
       C5F877               vzeroupper
G_M10759_IG02:              ;; offset=0003H
       488D0500000000       lea      rax, [(reloc 0x4000000000420060)]
       C5F91000             vmovupd  xmm0, xmmword ptr [rax]
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
G_M10759_IG03:              ;; offset=0015H
       C3                   ret

; Total bytes of code 22, prolog size 3, PerfScore 11.15, instruction count 6, allocated bytes for code 24 (MethodHash=107ed5f8) for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; ============================================================

cc @tannergooding

@ghost
Copy link

ghost commented Jul 9, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #71885
These methods aren't intrinsics and when the more conservative inliner is used (for AOT in non --Ot mode, which is default) it fails to inline them.

static Vector128<byte> CreateWithROSpan() => Vector128.Create("0123456789ABCDEF"u8);

Was:

; Assembly listing for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; ReadyToRun compilation
; optimized code

G_M10759_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC30             sub      rsp, 48
       33C0                 xor      eax, eax
       4889442420           mov      qword ptr [rsp+20H], rax
       488BF1               mov      rsi, rcx
                                                ;; size=15 bbWeight=1    PerfScore 2.75
G_M10759_IG02:              ;; offset=000FH
       488D0D00000000       lea      rcx, [(reloc 0x4000000000420060)]
       48894C2420           mov      bword ptr [rsp+20H], rcx
       C744242810000000     mov      dword ptr [rsp+28H], 16
       488BCE               mov      rcx, rsi
       488D542420           lea      rdx, [rsp+20H]
       FF1500000000         call     [System.Runtime.Intrinsics.Vector128:Create(System.ReadOnlySpan`1[System.Byte]):System.Runtime.Intrinsics.Vector128`1[System.Byte]]
       488BC6               mov      rax, rsi
                                                ;; size=37 bbWeight=1    PerfScore 6.50
G_M10759_IG03:              ;; offset=0034H
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret
                                                ;; size=6 bbWeight=1    PerfScore 1.75

; Total bytes of code 58, prolog size 12, PerfScore 16.80, instruction count 15, allocated bytes for code 58 (MethodHash=107ed5f8) for method D:CreateWithROSpan():System.Runtime.Intrinsics.Vector128`1[System.Byte]
; ============================================================

Now:

cc @tannergooding

Author: EgorBo
Assignees: -
Labels:

area-System.Runtime.Intrinsics

Milestone: -

@am11
Copy link
Member

am11 commented Jul 9, 2022

Thanks for the quick fix! While at it, feel free to update HexConverter.cs#L99-L107 to use ""u8 if you want. 😉

@tannergooding
Copy link
Member

We may want to look at making them intrinsic in .NET 8. The "equivalents" are intrinsic for System.Numerics.Vector<T>

@EgorBo
Copy link
Member Author

EgorBo commented Jul 9, 2022

We may want to look at making them intrinsic in .NET 8. The "equivalents" are intrinsic for System.Numerics.Vector<T>

@tannergooding Are you sure those are intrinsics for Vector<T>? I don't see it for:

private static ReadOnlySpan<byte> data => new []
{
    (byte)'0', (byte)'1', (byte)'2', (byte)'3',
    (byte)'4', (byte)'5', (byte)'6', (byte)'7',
    (byte)'8', (byte)'9', (byte)'A', (byte)'B',
    (byte)'C', (byte)'D', (byte)'E', (byte)'F',
    (byte)'0', (byte)'1', (byte)'2', (byte)'3',
    (byte)'4', (byte)'5', (byte)'6', (byte)'7',
    (byte)'8', (byte)'9', (byte)'A', (byte)'B',
    (byte)'C', (byte)'D', (byte)'E', (byte)'F',
};

static Vector<byte> CreateWithROSpan() => new Vector<byte>(data);

dump: https://gist.github.com/EgorBo/564520f9abfa8bc6da5103dc3a6606c9

Thanks for the quick fix! While at it, feel free to update HexConverter.cs#L99-L107 to use ""u8 if you want. 😉

leaving it for you 🙂

@EgorBo
Copy link
Member Author

EgorBo commented Jul 9, 2022

Ah, probably the T[] array overloads. ReadOnlySpan<T> going to be difficult to handle - it emits quite a verbose IR in JIT 🙁

@am11
Copy link
Member

am11 commented Jul 10, 2022

for AOT in non --Ot mode, which is default

The optimization is certainly turned on for Release configuration:

<Optimize Condition="'$(Configuration)' == 'Release' and '$(Optimize)' == ''">true</Optimize>
(-O is passed to ilc, which can be seen in VectorCreation/obj/Release/net7.0/linux-x64/native/VectorCreation.ilc.rsp from repro I posted)

@EgorBo
Copy link
Member Author

EgorBo commented Jul 10, 2022

for AOT in non --Ot mode, which is default

The optimization is certainly turned on for Release configuration:

<Optimize Condition="'$(Configuration)' == 'Release' and '$(Optimize)' == ''">true</Optimize>

(-O is passed to ilc, which can be seen in VectorCreation/obj/Release/net7.0/linux-x64/native/VectorCreation.ilc.rsp from repro I posted)

Yes, but there is a difference between -O and --Ot. -O is a sort of default optimization mode (aka blended) while --Ot is "prefer speed". --Ot uses a more aggressive inliner (hence, bigger binary size).

Same applies to crossgen2

@am11
Copy link
Member

am11 commented Jul 10, 2022

Ah nice, I missed that. With -p:IlcOptimizationPreference=Speed (that maps to --Ot), it gives:

Dump of assembler code for function VectorCreation_D__CreateWithROSpan:
   0x0000000000148410 <+0>:	push   rbp
   0x0000000000148411 <+1>:	mov    rbp,rsp
   0x0000000000148414 <+4>:	lea    rax,[rip+0x1e8f15]        # 0x331330
   0x000000000014841b <+11>:	movups xmm0,XMMWORD PTR [rax]
   0x000000000014841e <+14>:	movups XMMWORD PTR [rsi],xmm0
   0x0000000000148421 <+17>:	mov    rax,rsi
   0x0000000000148424 <+20>:	pop    rbp
   0x0000000000148425 <+21>:	ret    
End of assembler dump.

but the byte, byte, byte ... x 16 overload still has a better codegen. (I initially thought that we have params byte[] overload but I was wrong; we have specialized one for 16 bytes args).

@stephentoub stephentoub merged commit 25216ae into dotnet:main Jul 10, 2022
@EgorBo
Copy link
Member Author

EgorBo commented Jul 10, 2022

but the byte, byte, byte ... x 16 overload still has a better codegen.

The difference is that ROS<> is a static RVA -> it can be a single storage for vectors in different methods while the byte-byte-byte.. overload will emit the data per-method in the .data section - it's easier to access from both jit and aot - just a single load instruction with RIP-addressing. So, probably, your suggested change will always be slightly slower for AOT/CG2

@EgorBo EgorBo deleted the add-aggr-inli branch July 10, 2022 01:11
@ghost ghost locked as resolved and limited conversation to collaborators Aug 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vector128.Create() codegen: ReadOnlySpan<byte> vs. byte[] overloads
4 participants