Inefficient Vector128.Equals* comparisons on Arm64 #75849

TamarChristinaArm · 2022-09-19T18:01:33Z

Description

Somewhat related to #69325 in which it's shown that the mono implementation is slower than the CoreCLR one.
However the example also shows that the CoreCLR version is also inefficient:

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.CompilerServices;
using System.Numerics;

namespace HelloWorld
{
    internal class Program
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        private static bool test(Vector128<int> a, Vector128<int> b)
        {
            return Vector128.EqualsAll(a,b);
        }

        private static void Main(string[] args)
        {
            Vector128<int> A = Vector128.Create((int)3.1);
            Vector128<int> B = Vector128.Create((int)5.7);

            var result = test(A, B);
            Console.WriteLine(result);
        }
    }
}

Generates

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminv   b16, v16.16b
        0E013E00          umov    w0, v16.b[0]
        7100001F          cmp     w0, #0
        9A9F07E0          cset    x0, ne

However you don't need a full reduction here, in particular reductions on bytes tend to be the most expensive ones.

Instead it should do a partial reduction using uminp:

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminp   v16.4s, v16.4s, v16.4s
        0E013E00          umov    x0, v16.d[0]
        7100001F          tst    x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

Basically you compress the range to 64-bits and check that when getting the minimum you find no 0 values.
This has better throughput and latency.

Configuration

Which version of .NET is the code running on? main build from 01f701c9ff2b5008528076688b48f602d155427b
What OS version, and what distro if applicable? Ubuntu
What is the architecture (x64, x86, ARM, ARM64)? ARM64

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-09-19T18:01:40Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-09-19T18:13:35Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Somewhat related to #69325 in which it's shown that the mono implementation is slower than the CoreCLR one.
However the example also shows that the CoreCLR version is also inefficient:

using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.CompilerServices;
using System.Numerics;

namespace HelloWorld
{
    internal class Program
    {
        [MethodImpl(MethodImplOptions.NoInlining)]
        private static bool test(Vector128<int> a, Vector128<int> b)
        {
            return Vector128.EqualsAll(a,b);
        }

        private static void Main(string[] args)
        {
            Vector128<int> A = Vector128.Create((int)3.1);
            Vector128<int> B = Vector128.Create((int)5.7);

            var result = test(A, B);
            Console.WriteLine(result);
        }
    }
}

Generates

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminv   b16, v16.16b
        0E013E00          umov    w0, v16.b[0]
        7100001F          cmp     w0, #0
        9A9F07E0          cset    x0, ne

However you don't need a full reduction here, in particular reductions on bytes tend to be the most expensive ones.

Instead it should do a partial reduction using uminp:

G_M12821_IG02:              ;; offset=0010H
        3DC00BB0          ldr     q16, [fp, #0x20]
        3DC007B1          ldr     q17, [fp, #0x10]
        6EB18E10          cmeq    v16.4s, v16.4s, v17.4s
        6E31AA10          uminp   v16.4s, v16.4s, v16.4s
        0E013E00          umov    x0, v16.d[0]
        7100001F          tst    x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

Basically you compress the range to 64-bytes and check that when getting the minimum you find no 0 values.
This has better throughput and latency.

Configuration

Which version of .NET is the code running on? main build from 01f701c9ff2b5008528076688b48f602d155427b
What OS version, and what distro if applicable? Ubuntu
What is the architecture (x64, x86, ARM, ARM64)? ARM64

Author:	TamarChristinaArm
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

TamarChristinaArm · 2022-09-19T18:14:30Z

@TamarChristinaArm is

        7100001F          tst     x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

better than

        7100001F          cmp     w0, #0
        9A9F07E0          cset    x0, ne

?

@EgorBo in terms of performance no, but you can't use the cmp in this case since if your vector compares end up giving you
0xFFFF_0000 i.e. There was an unequal element in the comparison and some equal, comparing the register with 0 would incorrectly return that all the elements were equal. So you need to check all the bits and the range of cmp won't allow that.

EgorBo · 2022-09-19T18:14:58Z

Ah, I see, thanks!

EgorBo · 2022-09-19T19:20:20Z

@TamarChristinaArm btw, we have a special case for comparison against zero vector:

private static bool iszero(Vector128<int> a)
{
    return a == Vector128<int>.Zero;
}

Codegen:

; Method Program:iszero(System.Runtime.Intrinsics.Vector128`1[int]):bool
G_M6372_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50
G_M6372_IG02:              ;; offset=0008H
        6EB0A810          umaxv   s16, v0.4s
        0E043E00          umov    w0, v16.s[0]
        7100001F          cmp     w0, #0
        9A9F17E0          cset    x0, eq
						;; size=16 bbWeight=1    PerfScore 5.00
G_M6372_IG03:              ;; offset=0018H
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00

does it look good? Afair there was some other instruction instead umaxv to make it faster for some old hardware (@echesakov might remember 🙂)

TamarChristinaArm · 2022-09-19T19:30:06Z

@EgorBo ah that's a nice special case.
It's not as bad as the byte version, but the same trick can speed it up on all hardware,

so

G_M6372_IG02:              ;; offset=0008H
        6EB0A810          umaxp   v0.4s, v0.4s, v0.4s
        0E043E00          umov    x0, v0.d[0]

That's about a cycle faster on most Arm cores but has much better throughput.

TamarChristinaArm · 2022-09-19T19:47:56Z

Oh and for the Vectot64<T> version you don't need the compression since only the bottom 64-bits contain values anyway.
so

        2EB18E10          cmeq    v16.2s, v16.2s, v17.2s
        0E013E00          umov    w0, v16.d[0]
        7100001F          tst     x0, 0xfffffffff
        9A9F07E0          cset    x0, eq

I titled the issue Vector128, but it applies to both.

TamarChristinaArm · 2022-09-19T20:08:27Z

@TamarChristinaArm regarding tst x0, 0xfffffffff I am not sure I understand this function, is 0xfffffffff constant imm-able? e.g. https://godbolt.org/z/z3e9fKhW1

Ah sorry, my example had an f too many, https://godbolt.org/z/3ThhPTTfT
tst accepts bitmasks and an all 1s mask is easily representable.

EgorBo · 2022-09-19T20:37:15Z

@TamarChristinaArm just to double check,
so in your snippet you have

        tst     x0, 0xffffffff
        cset    x0, eq

I am just failing to understand why it's not the same as:

        cmp     w0, #0
        cset    w0, eq

in both cases we take a 64bit register and check whether its lower 32bit part contains any set bits, am I right?
essentially the same as here: https://godbolt.org/z/c9E179rv5

EgorBo · 2022-09-19T20:39:32Z

E.g. is this correct:

G_M55131_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
                                                ;; size=8 bbWeight=1    PerfScore 1.50
G_M55131_IG02:              ;; offset=0008H
        6EA18C10          cmeq    v16.4s, v0.4s, v1.4s
        6EB0AE10          uminp   v16.4s, v16.4s, v16.4s
        0E043E00          umov    w0, v16.s[0]
        7100001F          cmp     w0, #0
        9A9F17E0          cset    x0, eq
                                                ;; size=20 bbWeight=1    PerfScore 4.00
G_M55131_IG03:              ;; offset=001CH
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr

TamarChristinaArm · 2022-09-19T20:53:45Z

@TamarChristinaArm just to double check, so in your snippet you have
        tst     x0, 0xffffffff
        cset    x0, eq
I am just failing to understand why it's not the same as:
        cmp     w0, #0
        cset    w0, eq
in both cases we take a 64bit register and check whether its lower 32bit part contains any set bits, am I right? essentially the same as here: https://godbolt.org/z/c9E179rv5

Arg... sorry that mask should have been 0xfffffffffffffff We check the entire 64-bit value. See https://godbolt.org/z/1KEfbaaxa
Let me add some _ between these to make it more readable..

we check that the lower 64-bit containts all bits sets. It's important to remember that the uint64_t represents multiple vector lanes, not a single value.

so say your vector is

a = { 1, 1, 0, 1 }

and you're checking that all elements are >= 1.
For this example you get the results

a = { 0xffff_ffff, 0xffff_ffff, 0, 0xffff_ffff }

compressing it gives you

a = { 0xffff_ffff, 0, 0xffff_ffff, 0 }

and transferring the lower 64-bits gives

a = 0xffff_ffff_0000_0000

Now if you check a == 0, you'll get false for when a = 0xffff_ffff_ffff_ffff, which should be true, since all elements were >= 1.

if you check a != 0 you'll get true for 0xffff_ffff_0000_0000 but that's wrong since one element wasn't >= 1.

Which is why you need the tst againt all the 64-bits, because the only case that should be true is when all bits are set.

Sorry for the mixed up bitmask, hope this clears it up.

EgorBo · 2022-09-19T21:56:49Z

@TamarChristinaArm thanks for clarification! so we get a 64bit value from SIMD and we need to make sure all bits are set, so essentially it's u64 == 0xFFFF_FFFF_FFFF_FFFF if I understand it correctly, which is

cmn     x0, #1
cset    w0, eq

according to llvm.

Arg... sorry that mask should have been 0xfffffffffffffff

I assume it misses one more f to check the whole 64bit range

TamarChristinaArm · 2022-09-19T22:03:57Z

Yup, that's correct. lol, it's way too easy to miss an f in that mess..

EgorBo · 2022-09-19T22:06:01Z

Yup, that's correct. lol, it's way too easy to miss an f in that mess..

Heh, that's why in C# we have _ for integer literals 🙂

TamarChristinaArm · 2022-09-19T22:08:29Z

Yup, that's correct. lol, it's way too easy to miss an f in that mess..

Heh, that's why in C# we have _ for integer literals 🙂

hehe, it's a nice feature! it looks like indeed -1 isn't an tst immediate, we usually use the tst with 64-bit masks such as

tst    x0, 0xf0f0_f0f0_f0f0_f0f0

but it seems that indeed all 1s is punted to cmn.

TamarChristinaArm · 2022-09-19T22:28:12Z

btw,

cmp x0, #0 instead of cmn x0, #1 would be the semantics for EqualsAny.

EgorBo · 2022-09-20T15:57:15Z

Closed via #75864
Let's see what perf benchmarks think as it shows up on hot paths

Will check EqualsAny if it brings benefits 🙂 Thanks for the suggestion!

EgorBo · 2022-09-21T22:45:50Z

Oh since it's so beneficial let me file a PR for EqualsAny as well

EgorBo · 2022-09-22T01:37:00Z

@TamarChristinaArm regarding the special case for is zero, is this correct?:

static bool IsZero(Vector128<int> v0) => v0 == Vector128<int>.Zero;

        6EA0A410          umaxp   v16.4s, v0.4s, v0.4s
        4E083E00          umov    x0, v16.d[0]
        B100041F          cmn     x0, #1
        9A9F17E0          cset    x0, eq

TamarChristinaArm · 2022-09-22T02:07:41Z

@TamarChristinaArm regarding the special case for is zero, is this correct?:

static bool IsZero(Vector128<int> v0) => v0 == Vector128<int>.Zero;

        6EA0A410          umaxp   v16.4s, v0.4s, v0.4s
        4E083E00          umov    x0, v16.d[0]
        B100041F          cmn     x0, #1
        9A9F17E0          cset    x0, eq

Almost, in this case you want a comparison against 0. so cmp x0. #0, as you want to check that no lanes in v0 contained any values.

TamarChristinaArm added the tenet-performance Performance related issue label Sep 19, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Sep 19, 2022

EgorBo added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 19, 2022

EgorBo added this to the 8.0.0 milestone Sep 19, 2022

EgorBo self-assigned this Sep 19, 2022

EgorBo removed the untriaged New issue has not been triaged by the area owner label Sep 19, 2022

EgorBo mentioned this issue Sep 19, 2022

Faster Vector128/64 compare on arm64 #75864

Merged

danmoseley added the untriaged New issue has not been triaged by the area owner label Sep 20, 2022

EgorBo closed this as completed Sep 20, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label Sep 20, 2022

EgorBo mentioned this issue Sep 22, 2022

Improve "vec == Vector128<>.Zero" #75999

Merged

ghost locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient Vector128.Equals* comparisons on Arm64 #75849

Inefficient Vector128.Equals* comparisons on Arm64 #75849

TamarChristinaArm commented Sep 19, 2022 •

edited

Loading

dotnet-issue-labeler bot commented Sep 19, 2022

ghost commented Sep 19, 2022

Description

Configuration

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

EgorBo commented Sep 19, 2022 •

edited

Loading

TamarChristinaArm commented Sep 19, 2022 •

edited

Loading

TamarChristinaArm commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

EgorBo commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 20, 2022 •

edited

Loading

EgorBo commented Sep 21, 2022

EgorBo commented Sep 22, 2022

TamarChristinaArm commented Sep 22, 2022

Inefficient Vector128.Equals* comparisons on Arm64 #75849

Inefficient Vector128.Equals* comparisons on Arm64 #75849

Comments

TamarChristinaArm commented Sep 19, 2022 • edited Loading

Description

Configuration

dotnet-issue-labeler bot commented Sep 19, 2022

ghost commented Sep 19, 2022

Description

Configuration

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

EgorBo commented Sep 19, 2022 • edited Loading

TamarChristinaArm commented Sep 19, 2022 • edited Loading

TamarChristinaArm commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

EgorBo commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

TamarChristinaArm commented Sep 19, 2022

EgorBo commented Sep 20, 2022 • edited Loading

EgorBo commented Sep 21, 2022

EgorBo commented Sep 22, 2022

TamarChristinaArm commented Sep 22, 2022

TamarChristinaArm commented Sep 19, 2022 •

edited

Loading

EgorBo commented Sep 19, 2022 •

edited

Loading

TamarChristinaArm commented Sep 19, 2022 •

edited

Loading

EgorBo commented Sep 20, 2022 •

edited

Loading