Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align data in Buffer.Memmove for arm64 #93214

Merged
merged 6 commits into from
Oct 9, 2023
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Oct 9, 2023

Currently, we never switch to the native memmove on ARM64 for Mac/Linux. Instead of enabling it back, I tested manual alignment for dest and for most cases saw small wins or the same performance as the native memmove on Apple M2 (macOS) and Ampere (Linux).

What's more important, it shows nice wins compared to the current impl when data is not 16 bytes aligned (even just 8-byte alignment is expensive) which is quite a common case since GC only offers 8-byte alignment and e.g. String's data is always 4-byte aligned, etc. A benchmark on Apple M2:

static byte[] _src = new byte[10000];
static byte[] _dst = new byte[10000];

[Benchmark]
public void CopyTo_align4() => _src.AsSpan(4).CopyTo(_dst.AsSpan(4));

Presumably, it should also help with the noise in microbenchmarks since GC gives us 8-byte alignment which can be sometimes naturally aligned to 16 bytes or may be not, depends on the Moon's phase.

Method Toolchain Mean Error StdDev Ratio
CopyTo_align4 /Core_Root_base/corerun 195.1 ns 0.75 ns 0.70 ns 1.68
CopyTo_align4 /Core_Root_pr/corerun 116.2 ns 0.54 ns 0.50 ns 1.00

I was using the following benchmark to get exact alignment in my tests:

static Benchmarks()
{
    byte* align64_src = (byte*)NativeMemory.AlignedAlloc(1_000_000, 64);
    byte* align64_dst = (byte*)NativeMemory.AlignedAlloc(1_000_000, 64);
    _align4_src = align64_src + 4;
    _align4_dst = align64_dst + 4;
}

[Benchmark]
[Arguments(...)]
public void CopyTo_align4(int len) =>
    new Span<byte>(_align4_src, len).CopyTo(new Span<byte>(_align4_dst, len));

@ghost ghost assigned EgorBo Oct 9, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 9, 2023
@EgorBo EgorBo added area-System.Buffers and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 9, 2023
@ghost
Copy link

ghost commented Oct 9, 2023

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently, we never switch to the native memmove on ARM64 for Mac/Linux. Instead of enabling it back, I tested manual alignment for dest and for most cases saw wins over the native memmove on Apple M1 (macOS) and Ampere (Linux).

What's more important, it shows nice wins compared to the current impl when data is not 16 bytes aligned (even just 8-byte alignment is expensive) which is quite a common case since GC only offers 8-byte alignment and e.g. String's data is always 4-byte aligned, etc. A benchmark on Apple M2:

static byte[] _src = new byte[10000];
static byte[] _dst = new byte[10000];

[Benchmark]
public void CopyTo_align4() => _src.AsSpan(4).CopyTo(_dst.AsSpan(4));

Presumably, it should also help with the noise in microbenchmarks since GC gives us 8-byte alignment which can be sometimes naturally aligned to 16 bytes or may be not, depends on the Moon's phase.

Method Toolchain Mean Error StdDev Ratio
CopyTo_align4 /Core_Root_base/corerun 195.1 ns 0.75 ns 0.70 ns 1.68
CopyTo_align4 /Core_Root_pr/corerun 116.2 ns 0.54 ns 0.50 ns 1.00
Author: EgorBo
Assignees: EgorBo
Labels:

area-System.Buffers

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2023

@jkotas PTAL

@jkotas
Copy link
Member

jkotas commented Oct 9, 2023

// TODO-ARM64-UNIX-OPT revisit when glibc optimized memmove is in Linux distros

Would it be better to fix this TODO instead? What is the perf of your microbenchmark with MemmoveNativeThreshold = 2048?

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2023

// TODO-ARM64-UNIX-OPT revisit when glibc optimized memmove is in Linux distros

Would it be better to fix this TODO instead? What is the perf of your microbenchmark with MemmoveNativeThreshold = 2048?

Do you mean to jump into native for len>512 uncoditionally? I assume it was disabled for a reason and on some platforms it's slow?

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2023

What is the perf of your microbenchmark with MemmoveNativeThreshold = 2048

pretty much the same, sometimes the managed impl is a nanosec or two faster. But it's on Apple M2 where memmove is quite optimized, I am not so sure about other platforms/distros, e.g. I've seen a glibc version that didn't try to align data.

@jkotas
Copy link
Member

jkotas commented Oct 9, 2023

I assume it was disabled for a reason and on some platforms it's slow?

Yes, it was slow during the initial Arm64 bring up when we run on glibc without any Arm64 specific optimizations. I would expect that memcpy is optimized on all current Arm64 systems that we run on. It is what the TODO alludes to.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2023

Enabled for XARCH as well since the suggested way to align is cheaper. I am seeing wins even on XArch now for misaligned access:

Method Job Toolchain len Mean
CopyTo_align4 Job-VLSCTC \Core_Root\corerun.exe 256 2.284 ns
CopyTo_align4 Job-BCQRRZ \Core_Root_base\corerun.exe 256 2.299 ns
CopyTo_align4 Job-VLSCTC \Core_Root\corerun.exe 512 2.986 ns
CopyTo_align4 Job-BCQRRZ \Core_Root_base\corerun.exe 512 5.534 ns
CopyTo_align4 Job-VLSCTC \Core_Root\corerun.exe 1024 5.717 ns
CopyTo_align4 Job-BCQRRZ \Core_Root_base\corerun.exe 1024 8.299 ns
CopyTo_align4 Job-VLSCTC \Core_Root\corerun.exe 2048 11.466 ns
CopyTo_align4 Job-BCQRRZ \Core_Root_base\corerun.exe 2048 16.798 ns
CopyTo_align4 Job-VLSCTC \Core_Root\corerun.exe 4096 26.465 ns
CopyTo_align4 Job-BCQRRZ \Core_Root_base\corerun.exe 4096 27.076 ns

Len = 4096 is done in native.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2023

For the reference, codegen of block copy:

[StructLayout(LayoutKind.Sequential, Size = 64)]
private struct Block64 { }

static void CopyBlock64(ref byte dest, ref byte src)
{
    Unsafe.As<byte, Block64>(ref dest) = Unsafe.As<byte, Block64>(ref src);
}

AVX512 CPU:

       vmovdqu32 zmm0, zmmword ptr [rdx]
       vmovdqu32 zmmword ptr [rcx], zmm0

AVX CPU:

       vmovdqu  ymm0, ymmword ptr [rdx]
       vmovdqu  ymmword ptr [rcx], ymm0
       vmovdqu  ymm0, ymmword ptr [rdx+0x20]
       vmovdqu  ymmword ptr [rcx+0x20], ymm0

SSE2 CPU:

       movups   xmm0, xmmword ptr [rdx]
       movups   xmmword ptr [rcx], xmm0
       movups   xmm0, xmmword ptr [rdx+0x10]
       movups   xmmword ptr [rcx+0x10], xmm0
       movups   xmm0, xmmword ptr [rdx+0x20]
       movups   xmmword ptr [rcx+0x20], xmm0
       movups   xmm0, xmmword ptr [rdx+0x30]
       movups   xmmword ptr [rcx+0x30], xmm0

ARM64 NEON CPU:

            ldp     q16, q17, [x1]
            stp     q16, q17, [x0]
            ldp     q16, q17, [x1, #0x20]
            stp     q16, q17, [x0, #0x20]

Two notes:

  1. it probably makes sense for JIT to use at least 2 regs for BLK unrolling for better pipelining (or even 4 on some x-archs).
    Although, I wasn't able to detect wins from that on XARCH, but there were wins on ARM64
  2. AVX-512 CPUs might benefit from Block128 copy or even Block256 (My Ryzen with AVX512 won't benefit from it because its AVX-512 is basically 2xAVX2)

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@EgorBo
Copy link
Member Author

EgorBo commented Oct 9, 2023

Hopefully, will make benchmarks like this more stable:

image

@EgorBo EgorBo merged commit f20b2b5 into dotnet:main Oct 9, 2023
@EgorBo EgorBo deleted the align16-memmove branch October 9, 2023 21:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants