-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inefficient Vector128.Equals* comparisons on Arm64 #75849
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionSomewhat related to #69325 in which it's shown that the mono implementation is slower than the CoreCLR one. using System;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.CompilerServices;
using System.Numerics;
namespace HelloWorld
{
internal class Program
{
[MethodImpl(MethodImplOptions.NoInlining)]
private static bool test(Vector128<int> a, Vector128<int> b)
{
return Vector128.EqualsAll(a,b);
}
private static void Main(string[] args)
{
Vector128<int> A = Vector128.Create((int)3.1);
Vector128<int> B = Vector128.Create((int)5.7);
var result = test(A, B);
Console.WriteLine(result);
}
}
} Generates
However you don't need a full reduction here, in particular reductions on bytes tend to be the most expensive ones. Instead it should do a partial reduction using
Basically you compress the range to 64-bytes and check that when getting the minimum you find no 0 values. Configuration
|
@EgorBo in terms of performance no, but you can't use the |
Ah, I see, thanks! |
@TamarChristinaArm btw, we have a special case for comparison against zero vector: private static bool iszero(Vector128<int> a)
{
return a == Vector128<int>.Zero;
} Codegen: ; Method Program:iszero(System.Runtime.Intrinsics.Vector128`1[int]):bool
G_M6372_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp, #-0x10]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M6372_IG02: ;; offset=0008H
6EB0A810 umaxv s16, v0.4s
0E043E00 umov w0, v16.s[0]
7100001F cmp w0, #0
9A9F17E0 cset x0, eq
;; size=16 bbWeight=1 PerfScore 5.00
G_M6372_IG03: ;; offset=0018H
A8C17BFD ldp fp, lr, [sp], #0x10
D65F03C0 ret lr
;; size=8 bbWeight=1 PerfScore 2.00 does it look good? Afair there was some other instruction instead |
@EgorBo ah that's a nice special case. so G_M6372_IG02: ;; offset=0008H
6EB0A810 umaxp v0.4s, v0.4s, v0.4s
0E043E00 umov x0, v0.d[0] That's about a cycle faster on most Arm cores but has much better throughput. |
Oh and for the 2EB18E10 cmeq v16.2s, v16.2s, v17.2s
0E013E00 umov w0, v16.d[0]
7100001F tst x0, 0xfffffffff
9A9F07E0 cset x0, eq I titled the issue |
Ah sorry, my example had an |
@TamarChristinaArm just to double check,
I am just failing to understand why it's not the same as:
in both cases we take a 64bit register and check whether its lower 32bit part contains any set bits, am I right? |
E.g. is this correct: G_M55131_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp, #-0x10]!
910003FD mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M55131_IG02: ;; offset=0008H
6EA18C10 cmeq v16.4s, v0.4s, v1.4s
6EB0AE10 uminp v16.4s, v16.4s, v16.4s
0E043E00 umov w0, v16.s[0]
7100001F cmp w0, #0
9A9F17E0 cset x0, eq
;; size=20 bbWeight=1 PerfScore 4.00
G_M55131_IG03: ;; offset=001CH
A8C17BFD ldp fp, lr, [sp], #0x10
D65F03C0 ret lr |
Arg... sorry that mask should have been we check that the lower 64-bit containts all bits sets. It's important to remember that the uint64_t represents multiple vector lanes, not a single value. so say your vector is
and you're checking that all elements are
compressing it gives you
and transferring the lower 64-bits gives
Now if you check if you check Which is why you need the tst againt all the 64-bits, because the only case that should be true is when all bits are set. Sorry for the mixed up bitmask, hope this clears it up. |
@TamarChristinaArm thanks for clarification! so we get a 64bit value from SIMD and we need to make sure all bits are set, so essentially it's cmn x0, #1
cset w0, eq according to llvm.
I assume it misses one more |
Yup, that's correct. lol, it's way too easy to miss an |
Heh, that's why in C# we have |
hehe, it's a nice feature! it looks like indeed
but it seems that indeed all |
btw,
|
Closed via #75864 Will check EqualsAny if it brings benefits 🙂 Thanks for the suggestion! |
Oh since it's so beneficial let me file a PR for EqualsAny as well |
@TamarChristinaArm regarding the special case for static bool IsZero(Vector128<int> v0) => v0 == Vector128<int>.Zero; 6EA0A410 umaxp v16.4s, v0.4s, v0.4s
4E083E00 umov x0, v16.d[0]
B100041F cmn x0, #1
9A9F17E0 cset x0, eq |
Almost, in this case you want a comparison against 0. so |
Description
Somewhat related to #69325 in which it's shown that the mono implementation is slower than the CoreCLR one.
However the example also shows that the CoreCLR version is also inefficient:
Generates
However you don't need a full reduction here, in particular reductions on bytes tend to be the most expensive ones.
Instead it should do a partial reduction using
uminp
:Basically you compress the range to 64-bits and check that when getting the minimum you find no 0 values.
This has better throughput and latency.
Configuration
main
build from01f701c9ff2b5008528076688b48f602d155427b
The text was updated successfully, but these errors were encountered: