Improve "vec == Vector128<>.Zero" #75999

EgorBo · 2022-09-22T02:33:17Z

Follow up to #75864 to address @TamarChristinaArm's suggestion in #75849 (comment)
Btw, previous improvements seem to show nice benefits #75864 (comment)

static bool IsZero1(Vector128<int> v) => v == Vector128<int>.Zero;
static bool IsZero2(Vector64<int> v) => v == Vector64<int>.Zero;

Codegen diff:

; Method Tests:IsZero1
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   s16, v0.4s
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umaxp   v16.4s, v0.4s, v0.4s
+           umov    x0, v16.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

; Method Tests:IsZero2
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   h16, v0.4h
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umov    x0, v0.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

Should be a nice win for Vector64. For Vector128 I wasn't able to see noticeable improvements on my Apple M1 but we might see improvements on other (hopefully, on Ampere Altra?)

However, even on M1 it seems to be better:

UMAXV                                     3          0.25     1        -     -     1     u11-14
UMAXP                                     2          0.25     1        -     -     1     u11-14

(1st column is Latency, according to https://dougallj.github.io/applecpu/firestorm-simd.html)

ghost · 2022-09-22T02:33:31Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow up to #75864 to address @TamarChristinaArm's suggestion in #75849 (comment)

static bool IsZero1(Vector128<int> v) => v == Vector128<int>.Zero;
static bool IsZero2(Vector64<int> v) => v == Vector64<int>.Zero;

Codegen diff:

; Method Tests:IsZero1
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   s16, v0.4s
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umaxp   v16.4s, v0.4s, v0.4s
+           umov    x0, v16.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

; Method Tests:IsZero2
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   h16, v0.4h
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umov    x0, v0.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

Should be a nice win for Vector64. For Vector128 I wasn't able to see noticeable improvements on my Apple M1 but we might see improvements on other (hopefully, on Ampere Altra?)

However, even on M1 it seems to be better:

UMAXV                                     3          0.25     1        -     -     1     u11-14
UMAXP                                     2          0.25     1        -     -     1     u11-14

(1st column is Latency, according to https://dougallj.github.io/applecpu/firestorm-simd.html)

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2022-09-22T02:38:45Z

Oh, from what I see Latency on Ampere for umaxv is 6 for byte while umaxp for int is 2 so should be visible

TamarChristinaArm · 2022-09-22T02:42:51Z

Oh, from what I see Latency on Ampere for umaxv is 6 for byte while umaxp for int is 2 so should be visible

The 6 cycles is I believe only for bytes, the zero special case I don't expect the same dramatic difference as it's reducing from .s which are fairly cheap. I only expect a single cycle difference in this case. But it has better throughput so I at least expect more consistent performance.

EgorBo · 2022-09-22T14:07:25Z

@dotnet/jit-contrib @kunalspathak PTAL, previously the logic was enabled for Vector3 (SIMD12) too and that was wrong, I'll try to come up with a bug repro and backport a quick fix to net7.0

EgorBo · 2022-09-22T14:12:08Z

@dotnet/jit-contrib @kunalspathak PTAL, previously the logic was enabled for Vector3 (SIMD12) too and that was wrong, I'll try to come up with a bug repro and backport a quick fix to net7.0

Ah, Vector3 is only used for floats so it won't pass the !varIsFloatingPoint so we're fine, but let's keep the check here in case if we introduce Vector3 for other primitives.

kunalspathak · 2022-09-22T18:57:56Z

src/coreclr/jit/lowerarmarch.cpp

-        BlockRange().InsertBefore(node, cmp);
-        LowerNode(cmp);
+        GenTree* cmp = op;
+        if (simdSize != 8) // we don't need compression for Vector64


Can you update the comment few lines above to have MaxPairWise instead of MaxAcross?

EgorBo · 2022-09-26T00:37:55Z

Seems to be also beneficial:

EgorBo added 2 commits September 22, 2022 04:15

Improve Vector comparison against zero

c162e12

Fix build

5527c02

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 22, 2022

ghost assigned EgorBo Sep 22, 2022

Clean up

4dff1da

kunalspathak reviewed Sep 22, 2022

View reviewed changes

Update lowerarmarch.cpp

08047de

kunalspathak approved these changes Sep 22, 2022

View reviewed changes

EgorBo merged commit 011b949 into dotnet:main Sep 22, 2022

EgorBo deleted the arm64-faster-vec-zero branch September 22, 2022 21:36

EgorBo mentioned this pull request Sep 28, 2022

Add intrinsic for IndexOfAny on Arm64 #74010

Closed

kunalspathak mentioned this pull request Sep 29, 2022

[Perf] Windows/arm64: 10 Improvements on 9/22/2022 11:37:30 PM dotnet/perf-autofiling-issues#8767

Closed

ghost locked as resolved and limited conversation to collaborators Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve "vec == Vector128<>.Zero" #75999

Improve "vec == Vector128<>.Zero" #75999

EgorBo commented Sep 22, 2022 •

edited

Loading

ghost commented Sep 22, 2022

EgorBo commented Sep 22, 2022

TamarChristinaArm commented Sep 22, 2022

EgorBo commented Sep 22, 2022

EgorBo commented Sep 22, 2022

kunalspathak Sep 22, 2022

EgorBo Sep 22, 2022

EgorBo commented Sep 26, 2022

Improve "vec == Vector128<>.Zero" #75999

Improve "vec == Vector128<>.Zero" #75999

Conversation

EgorBo commented Sep 22, 2022 • edited Loading

ghost commented Sep 22, 2022

EgorBo commented Sep 22, 2022

TamarChristinaArm commented Sep 22, 2022

EgorBo commented Sep 22, 2022

EgorBo commented Sep 22, 2022

kunalspathak Sep 22, 2022

Choose a reason for hiding this comment

EgorBo Sep 22, 2022

Choose a reason for hiding this comment

EgorBo commented Sep 26, 2022

EgorBo commented Sep 22, 2022 •

edited

Loading