Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve "vec == Vector128<>.Zero" #75999

Merged
merged 4 commits into from
Sep 22, 2022
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Sep 22, 2022

Follow up to #75864 to address @TamarChristinaArm's suggestion in #75849 (comment)
Btw, previous improvements seem to show nice benefits #75864 (comment)

static bool IsZero1(Vector128<int> v) => v == Vector128<int>.Zero;
static bool IsZero2(Vector64<int> v) => v == Vector64<int>.Zero;

Codegen diff:

; Method Tests:IsZero1
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   s16, v0.4s
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umaxp   v16.4s, v0.4s, v0.4s
+           umov    x0, v16.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

; Method Tests:IsZero2
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   h16, v0.4h
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umov    x0, v0.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

Should be a nice win for Vector64. For Vector128 I wasn't able to see noticeable improvements on my Apple M1 but we might see improvements on other (hopefully, on Ampere Altra?)

However, even on M1 it seems to be better:

UMAXV                                     3          0.25     1        -     -     1     u11-14
UMAXP                                     2          0.25     1        -     -     1     u11-14

(1st column is Latency, according to https://dougallj.github.io/applecpu/firestorm-simd.html)

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 22, 2022
@ghost ghost assigned EgorBo Sep 22, 2022
@ghost
Copy link

ghost commented Sep 22, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Follow up to #75864 to address @TamarChristinaArm's suggestion in #75849 (comment)

static bool IsZero1(Vector128<int> v) => v == Vector128<int>.Zero;
static bool IsZero2(Vector64<int> v) => v == Vector64<int>.Zero;

Codegen diff:

; Method Tests:IsZero1
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   s16, v0.4s
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umaxp   v16.4s, v0.4s, v0.4s
+           umov    x0, v16.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

; Method Tests:IsZero2
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp

-           umaxv   h16, v0.4h
-           umov    w0, v16.s[0]
-           cmp     w0, #0
+           umov    x0, v0.d[0]
+           cmp     x0, #0
            cset    x0, eq

            ldp     fp, lr, [sp], #0x10
            ret     lr

Should be a nice win for Vector64. For Vector128 I wasn't able to see noticeable improvements on my Apple M1 but we might see improvements on other (hopefully, on Ampere Altra?)

However, even on M1 it seems to be better:

UMAXV                                     3          0.25     1        -     -     1     u11-14
UMAXP                                     2          0.25     1        -     -     1     u11-14

(1st column is Latency, according to https://dougallj.github.io/applecpu/firestorm-simd.html)

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Sep 22, 2022

Oh, from what I see Latency on Ampere for umaxv is 6 for byte while umaxp for int is 2 so should be visible

@TamarChristinaArm
Copy link
Contributor

Oh, from what I see Latency on Ampere for umaxv is 6 for byte while umaxp for int is 2 so should be visible

The 6 cycles is I believe only for bytes, the zero special case I don't expect the same dramatic difference as it's reducing from .s which are fairly cheap. I only expect a single cycle difference in this case. But it has better throughput so I at least expect more consistent performance.

@EgorBo
Copy link
Member Author

EgorBo commented Sep 22, 2022

@dotnet/jit-contrib @kunalspathak PTAL, previously the logic was enabled for Vector3 (SIMD12) too and that was wrong, I'll try to come up with a bug repro and backport a quick fix to net7.0

@EgorBo
Copy link
Member Author

EgorBo commented Sep 22, 2022

@dotnet/jit-contrib @kunalspathak PTAL, previously the logic was enabled for Vector3 (SIMD12) too and that was wrong, I'll try to come up with a bug repro and backport a quick fix to net7.0

Ah, Vector3 is only used for floats so it won't pass the !varIsFloatingPoint so we're fine, but let's keep the check here in case if we introduce Vector3 for other primitives.

BlockRange().InsertBefore(node, cmp);
LowerNode(cmp);
GenTree* cmp = op;
if (simdSize != 8) // we don't need compression for Vector64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comment few lines above to have MaxPairWise instead of MaxAcross?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@EgorBo EgorBo merged commit 011b949 into dotnet:main Sep 22, 2022
@EgorBo EgorBo deleted the arm64-faster-vec-zero branch September 22, 2022 21:36
@EgorBo
Copy link
Member Author

EgorBo commented Sep 26, 2022

Seems to be also beneficial:

image

@ghost ghost locked as resolved and limited conversation to collaborators Oct 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants