Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient Vector128.Equals* comparisons on Arm64 #75849

Closed
TamarChristinaArm opened this issue Sep 19, 2022 · 20 comments
Closed

Inefficient Vector128.Equals* comparisons on Arm64 #75849

TamarChristinaArm opened this issue Sep 19, 2022 · 20 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Sep 19, 2022

Description

Somewhat related to #69325 in which it's shown that the mono implementation is slower than the CoreCLR one.
However the example also shows that the CoreCLR version is also inefficient:

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.CompilerServices;
using System.Numerics;

namespace HelloWorld
{
    internal class Program
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        private static bool test(Vector128<int> a, Vector128<int> b)
        {
            return Vector128.EqualsAll(a,b);
        }

        private static void Main(string[] args)
        {
            Vector128<int> A = Vector128.Create((int)3.1);
            Vector128<int> B = Vector128.Create((int)5.7);

            var result = test(A, B);
            Console.WriteLine(result);
        }
    }
}

Generates

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminv   b16, v16.16b
        0E013E00          umov    w0, v16.b[0]
        7100001F          cmp     w0, #0
        9A9F07E0          cset    x0, ne

However you don't need a full reduction here, in particular reductions on bytes tend to be the most expensive ones.

Instead it should do a partial reduction using uminp:

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminp   v16.4s, v16.4s, v16.4s
        0E013E00          umov    x0, v16.d[0]
        7100001F          tst    x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

Basically you compress the range to 64-bits and check that when getting the minimum you find no 0 values.
This has better throughput and latency.

Configuration

  • Which version of .NET is the code running on? main build from 01f701c9ff2b5008528076688b48f602d155427b
  • What OS version, and what distro if applicable? Ubuntu
  • What is the architecture (x64, x86, ARM, ARM64)? ARM64
@TamarChristinaArm TamarChristinaArm added the tenet-performance Performance related issue label Sep 19, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 19, 2022
@EgorBo EgorBo added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 19, 2022
@EgorBo EgorBo added this to the 8.0.0 milestone Sep 19, 2022
@ghost
Copy link

ghost commented Sep 19, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Somewhat related to #69325 in which it's shown that the mono implementation is slower than the CoreCLR one.
However the example also shows that the CoreCLR version is also inefficient:

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.CompilerServices;
using System.Numerics;

namespace HelloWorld
{
    internal class Program
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        private static bool test(Vector128<int> a, Vector128<int> b)
        {
            return Vector128.EqualsAll(a,b);
        }

        private static void Main(string[] args)
        {
            Vector128<int> A = Vector128.Create((int)3.1);
            Vector128<int> B = Vector128.Create((int)5.7);

            var result = test(A, B);
            Console.WriteLine(result);
        }
    }
}

Generates

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminv   b16, v16.16b
        0E013E00          umov    w0, v16.b[0]
        7100001F          cmp     w0, #0
        9A9F07E0          cset    x0, ne

However you don't need a full reduction here, in particular reductions on bytes tend to be the most expensive ones.

Instead it should do a partial reduction using uminp:

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminp   v16.4s, v16.4s, v16.4s
        0E013E00          umov    x0, v16.d[0]
        7100001F          tst    x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

Basically you compress the range to 64-bytes and check that when getting the minimum you find no 0 values.
This has better throughput and latency.

Configuration

  • Which version of .NET is the code running on? main build from 01f701c9ff2b5008528076688b48f602d155427b
  • What OS version, and what distro if applicable? Ubuntu
  • What is the architecture (x64, x86, ARM, ARM64)? ARM64
Author: TamarChristinaArm
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@EgorBo EgorBo self-assigned this Sep 19, 2022
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Sep 19, 2022
@TamarChristinaArm
Copy link
Contributor Author

@TamarChristinaArm is

        7100001F          tst     x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

better than

        7100001F          cmp     w0, #0
        9A9F07E0          cset    x0, ne

?

@EgorBo in terms of performance no, but you can't use the cmp in this case since if your vector compares end up giving you
0xFFFF_0000 i.e. There was an unequal element in the comparison and some equal, comparing the register with 0 would incorrectly return that all the elements were equal. So you need to check all the bits and the range of cmp won't allow that.

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2022

Ah, I see, thanks!

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2022

@TamarChristinaArm btw, we have a special case for comparison against zero vector:

private static bool iszero(Vector128<int> a)
{
    return a == Vector128<int>.Zero;
}

Codegen:

; Method Program:iszero(System.Runtime.Intrinsics.Vector128`1[int]):bool
G_M6372_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M6372_IG02:              ;; offset=0008H
        6EB0A810          umaxv   s16, v0.4s
        0E043E00          umov    w0, v16.s[0]
        7100001F          cmp     w0, #0
        9A9F17E0          cset    x0, eq
						;; size=16 bbWeight=1    PerfScore 5.00
G_M6372_IG03:              ;; offset=0018H
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

does it look good? Afair there was some other instruction instead umaxv to make it faster for some old hardware (@echesakov might remember 🙂)

@TamarChristinaArm
Copy link
Contributor Author

TamarChristinaArm commented Sep 19, 2022

@EgorBo ah that's a nice special case.
It's not as bad as the byte version, but the same trick can speed it up on all hardware,

so

G_M6372_IG02:              ;; offset=0008H
        6EB0A810          umaxp   v0.4s, v0.4s, v0.4s
        0E043E00          umov    x0, v0.d[0]

That's about a cycle faster on most Arm cores but has much better throughput.

@TamarChristinaArm
Copy link
Contributor Author

Oh and for the Vectot64<T> version you don't need the compression since only the bottom 64-bits contain values anyway.
so

        2EB18E10          cmeq    v16.2s, v16.2s, v17.2s
        0E013E00          umov    w0, v16.d[0]
        7100001F          tst     x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

I titled the issue Vector128, but it applies to both.

@TamarChristinaArm
Copy link
Contributor Author

@TamarChristinaArm regarding tst x0, 0xfffffffff I am not sure I understand this function, is 0xfffffffff constant imm-able? e.g. https://godbolt.org/z/z3e9fKhW1

Ah sorry, my example had an f too many, https://godbolt.org/z/3ThhPTTfT
tst accepts bitmasks and an all 1s mask is easily representable.

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2022

@TamarChristinaArm just to double check,
so in your snippet you have

        tst     x0, 0xffffffff
        cset    x0, eq

I am just failing to understand why it's not the same as:

        cmp     w0, #0
        cset    w0, eq

in both cases we take a 64bit register and check whether its lower 32bit part contains any set bits, am I right?
essentially the same as here: https://godbolt.org/z/c9E179rv5

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2022

E.g. is this correct:

G_M55131_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
                                                ;; size=8 bbWeight=1    PerfScore 1.50
G_M55131_IG02:              ;; offset=0008H
        6EA18C10          cmeq    v16.4s, v0.4s, v1.4s
        6EB0AE10          uminp   v16.4s, v16.4s, v16.4s
        0E043E00          umov    w0, v16.s[0]
        7100001F          cmp     w0, #0
        9A9F17E0          cset    x0, eq
                                                ;; size=20 bbWeight=1    PerfScore 4.00
G_M55131_IG03:              ;; offset=001CH
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr

@TamarChristinaArm
Copy link
Contributor Author

@TamarChristinaArm just to double check, so in your snippet you have

        tst     x0, 0xffffffff
        cset    x0, eq

I am just failing to understand why it's not the same as:

        cmp     w0, #0
        cset    w0, eq

in both cases we take a 64bit register and check whether its lower 32bit part contains any set bits, am I right? essentially the same as here: https://godbolt.org/z/c9E179rv5

Arg... sorry that mask should have been 0xfffffffffffffff We check the entire 64-bit value. See https://godbolt.org/z/1KEfbaaxa
Let me add some _ between these to make it more readable..

we check that the lower 64-bit containts all bits sets. It's important to remember that the uint64_t represents multiple vector lanes, not a single value.

so say your vector is

a = { 1, 1, 0, 1 }

and you're checking that all elements are >= 1.
For this example you get the results

a = { 0xffff_ffff, 0xffff_ffff, 0, 0xffff_ffff }

compressing it gives you

a = { 0xffff_ffff, 0, 0xffff_ffff, 0 }

and transferring the lower 64-bits gives

a = 0xffff_ffff_0000_0000

Now if you check a == 0, you'll get false for when a = 0xffff_ffff_ffff_ffff, which should be true, since all elements were >= 1.

if you check a != 0 you'll get true for 0xffff_ffff_0000_0000 but that's wrong since one element wasn't >= 1.

Which is why you need the tst againt all the 64-bits, because the only case that should be true is when all bits are set.

Sorry for the mixed up bitmask, hope this clears it up.

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2022

@TamarChristinaArm thanks for clarification! so we get a 64bit value from SIMD and we need to make sure all bits are set, so essentially it's u64 == 0xFFFF_FFFF_FFFF_FFFF if I understand it correctly, which is

cmn     x0, #1
cset    w0, eq

according to llvm.

Arg... sorry that mask should have been 0xfffffffffffffff

I assume it misses one more f to check the whole 64bit range

@TamarChristinaArm
Copy link
Contributor Author

Yup, that's correct. lol, it's way too easy to miss an f in that mess..

@EgorBo
Copy link
Member

EgorBo commented Sep 19, 2022

Yup, that's correct. lol, it's way too easy to miss an f in that mess..

Heh, that's why in C# we have _ for integer literals 🙂

@TamarChristinaArm
Copy link
Contributor Author

Yup, that's correct. lol, it's way too easy to miss an f in that mess..

Heh, that's why in C# we have _ for integer literals 🙂

hehe, it's a nice feature! it looks like indeed -1 isn't an tst immediate, we usually use the tst with 64-bit masks such as

tst    x0, 0xf0f0_f0f0_f0f0_f0f0

but it seems that indeed all 1s is punted to cmn.

@TamarChristinaArm
Copy link
Contributor Author

btw,

cmp x0, #0 instead of cmn x0, #1 would be the semantics for EqualsAny.

@danmoseley danmoseley added the untriaged New issue has not been triaged by the area owner label Sep 20, 2022
@EgorBo
Copy link
Member

EgorBo commented Sep 20, 2022

Closed via #75864
Let's see what perf benchmarks think as it shows up on hot paths

Will check EqualsAny if it brings benefits 🙂 Thanks for the suggestion!

@EgorBo EgorBo closed this as completed Sep 20, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Sep 20, 2022
@EgorBo
Copy link
Member

EgorBo commented Sep 21, 2022

Oh since it's so beneficial let me file a PR for EqualsAny as well

@EgorBo
Copy link
Member

EgorBo commented Sep 22, 2022

@TamarChristinaArm regarding the special case for is zero, is this correct?:

static bool IsZero(Vector128<int> v0) => v0 == Vector128<int>.Zero;
        6EA0A410          umaxp   v16.4s, v0.4s, v0.4s
        4E083E00          umov    x0, v16.d[0]
        B100041F          cmn     x0, #1
        9A9F17E0          cset    x0, eq

@TamarChristinaArm
Copy link
Contributor Author

@TamarChristinaArm regarding the special case for is zero, is this correct?:

static bool IsZero(Vector128<int> v0) => v0 == Vector128<int>.Zero;
        6EA0A410          umaxp   v16.4s, v0.4s, v0.4s
        4E083E00          umov    x0, v16.d[0]
        B100041F          cmn     x0, #1
        9A9F17E0          cset    x0, eq

Almost, in this case you want a comparison against 0. so cmp x0. #0, as you want to check that no lanes in v0 contained any values.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants